results – Burk et al. (2024)

This page gives an overview of the benchmark results, including scores aggregated across outer resampling iterations used for later statistical analysis and individual scores per dataset and model.

In general, results are divided by underlying tuning measure, i.e. harrel_c and isbs, with the former being a measure of discrimination and the latter a proper scoring rule.

Errros and elapsed time limits

The following table lists the number of errors in the outer resampling iterations per tuning measure (tuned). These errors were caused by the learner exceeding the time limit or exceeding memory limitations. We attempted to resubmit failing computational jobs with increased memory limits, yet in some cases the jobs still failed with more than 100GB of available memory, at which point we considered the learner/task combination to just be infeasible.

We note:

the affected learners were particularly slow or memory intensive for large tasks with many observations or a large number of unique time points, where the latter in particular appeared even more relevant than the number of observations.
the tasks below are most often those with many observations and unique time points (hdfail, child, check_times).

We therefore consider the errors to be a result of the learners’ complexity and the tasks’ size, given reasonable computational constraints.

Click to view table

Code

err_tbl <- scores |>
  dplyr::group_by(learner_id, task_id, tune_measure) |>
  dplyr::summarise(
    affected_iterations = sum(errors_cnt > 0),
    total_iterations = dplyr::n(),
    .groups = "drop"
  ) |>
  dplyr::filter(affected_iterations > 0) |>
  dplyr::mutate(
    error_rate = round(100 * affected_iterations / total_iterations, 1),
    errors_fmt = glue::glue("{affected_iterations} / {total_iterations} ({error_rate}%)")
  ) |>
  tidyr::pivot_wider(
    id_cols = c("learner_id", "task_id"),
    names_from = "tune_measure",
    values_from = "errors_fmt",
    values_fill = "—"
  )

err_tbl |>
  dplyr::select(-learner_id) |>
  kableExtra::kbl(
    col.names = c("Dataset", "Harrell's C", "ISBS"),
    caption = "Number of evaluations with errors of the total outer resampling iterations by tuning measure. (—) indicates there were no errors during evaluation, but possible during tuning."
  ) |>
  kableExtra::kable_styling() |>
  kableExtra::pack_rows(index = table(err_tbl$learner_id))

Number of evaluations with errors of the total outer resampling iterations by tuning measure. (—) indicates there were no errors during evaluation, but possible during tuning.
Dataset	Harrell's C	ISBS
AK
CarpenterFdaData	1 / 30 (3.3%)	—
channing	1 / 30 (3.3%)	1 / 30 (3.3%)
child	3 / 3 (100%)	3 / 3 (100%)
e1684	—	3 / 30 (10%)
hdfail	3 / 3 (100%)	3 / 3 (100%)
lung	—	8 / 30 (26.7%)
uis	—	2 / 30 (6.7%)
veteran	—	3 / 30 (10%)
CIF
child	3 / 3 (100%)	3 / 3 (100%)
hdfail	3 / 3 (100%)	3 / 3 (100%)
Flex
aids.id	10 / 30 (33.3%)	—
check_times	3 / 3 (100%)	3 / 3 (100%)
child	3 / 3 (100%)	3 / 3 (100%)
dataFTR	—	2 / 30 (6.7%)
hdfail	3 / 3 (100%)	3 / 3 (100%)
lung	—	9 / 30 (30%)
nafld1	14 / 15 (93.3%)	14 / 15 (93.3%)
nwtco	15 / 15 (100%)	15 / 15 (100%)
support	3 / 3 (100%)	3 / 3 (100%)
wa_churn	15 / 15 (100%)	15 / 15 (100%)
GLMN
bladder0	—	1 / 30 (3.3%)
channing	1 / 30 (3.3%)	—
check_times	—	2 / 3 (66.7%)
cost	—	12 / 30 (40%)
dataSTR	—	2 / 30 (6.7%)
hdfail	3 / 3 (100%)	—
std	—	6 / 30 (20%)
uis	—	4 / 30 (13.3%)
veteran	14 / 30 (46.7%)	—
wbc1	4 / 30 (13.3%)	—
MBSTAFT
hdfail	—	2 / 3 (66.7%)
MBSTCox
child	3 / 3 (100%)	3 / 3 (100%)
dataSTR	1 / 30 (3.3%)	—
hdfail	3 / 3 (100%)	3 / 3 (100%)
ORSF
child	3 / 3 (100%)	3 / 3 (100%)
cost	—	1 / 30 (3.3%)
gbsg	—	1 / 15 (6.7%)
hdfail	3 / 3 (100%)	3 / 3 (100%)
nafld1	9 / 15 (60%)	1 / 15 (6.7%)
uis	1 / 30 (3.3%)	—
veteran	—	1 / 30 (3.3%)
Pen
aids.id	9 / 30 (30%)	1 / 30 (3.3%)
bladder0	—	8 / 30 (26.7%)
channing	—	1 / 30 (3.3%)
check_times	3 / 3 (100%)	3 / 3 (100%)
cost	—	3 / 30 (10%)
dataSTR	3 / 30 (10%)	11 / 30 (36.7%)
hdfail	2 / 3 (66.7%)	—
RAN
check_times	2 / 3 (66.7%)	3 / 3 (100%)
child	3 / 3 (100%)	3 / 3 (100%)
cost	—	1 / 30 (3.3%)
hdfail	1 / 3 (33.3%)	1 / 3 (33.3%)
mgus	2 / 30 (6.7%)	—
nafld1	9 / 15 (60%)	4 / 15 (26.7%)
RFSRC
check_times	3 / 3 (100%)	3 / 3 (100%)
child	3 / 3 (100%)	3 / 3 (100%)
colrec	2 / 3 (66.7%)	1 / 3 (33.3%)
nafld1	1 / 15 (6.7%)	2 / 15 (13.3%)
support	3 / 3 (100%)	2 / 3 (66.7%)
RRT
dataFTR	—	5 / 30 (16.7%)
lung	—	8 / 30 (26.7%)
metabric	—	7 / 15 (46.7%)
nwtco	—	7 / 15 (46.7%)
ova	—	3 / 30 (10%)
tumor	—	3 / 30 (10%)
SSVM
check_times	—	3 / 3 (100%)
child	—	3 / 3 (100%)
colrec	—	3 / 3 (100%)
flchain	—	11 / 15 (73.3%)
hdfail	—	3 / 3 (100%)
nafld1	—	15 / 15 (100%)
nwtco	—	8 / 15 (53.3%)
ova	—	3 / 30 (10%)
support	—	3 / 3 (100%)
wa_churn	—	15 / 15 (100%)

Aggregated Results

Averaged scores across outer resampling folds for each task and learner.

Code

for (measure_id in msr_tbl[type == "Discrimination", id]) {
  plot_aggr_scores(
    aggr_scores,
    type = "box",
    eval_measure_id = measure_id,
    tuning_measure_id = "harrell_c",
    dodge = FALSE,
    flip = TRUE
  )
}

Harrell’s C (Scaled)

Code

for (measure_id in msr_tbl[type == "Discrimination", id]) {
  plot_aggr_scores(
    aggr_scores_scaled,
    type = "box",
    eval_measure_id = measure_id,
    tuning_measure_id = "harrell_c",
    dodge = FALSE,
    flip = TRUE
  ) %+%
    labs(
      title = glue::glue("{msr_tbl[id == measure_id, label]} [Scaled KM – Best]"),
      subtitle = "Boxplot of aggregated scores across all tasks scaled such that 0 = KM, 1 = Best model"
    )
}

Code

for (measure_id in msr_tbl[type == "Scoring Rule" & !erv, id]) {
  plot_aggr_scores(
    aggr_scores,
    type = "box",
    eval_measure_id = measure_id,
    tuning_measure_id = "isbs",
    dodge = FALSE,
    flip = TRUE
  )
}

ISBS (ERV)

Code

for (measure_id in msr_tbl[type == "Scoring Rule" & erv, id]) {
  plot_aggr_scores(
    aggr_scores,
    type = "box",
    eval_measure_id = measure_id,
    tuning_measure_id = "isbs",
    dodge = FALSE,
    flip = TRUE
  )
}

Scaled ISBS

Code

for (measure_id in msr_tbl[type == "Scoring Rule" & !erv, id]) {
  plot_aggr_scores(
    aggr_scores_scaled,
    type = "box",
    eval_measure_id = measure_id,
    tuning_measure_id = "isbs",
    dodge = FALSE,
    flip = TRUE
  ) %+%
    labs(
      title = glue::glue("{msr_tbl[id == measure_id, label]} [Scaled KM-Best]"),
      subtitle = "Boxplot of aggregated scores across all tasks scaled such that 0 = KM, 1 = Best model"
    )
}

Results per Dataset

Taking scores from the outer evaluation folds, see scores.[csv|rds].

Boxplots

Harrell’s C
ISBS

Code

for (measure_id in msr_tbl[type == "Discrimination", id]) {
  plot_scores(scores, eval_measure_id = measure_id, tuning_measure_id = "harrell_c", dodge = FALSE, flip = TRUE)
}

Code

for (measure_id in msr_tbl[type == "Scoring Rule" & !erv, id]) {
  plot_scores(scores, eval_measure_id = measure_id, tuning_measure_id = "isbs", dodge = FALSE, flip = TRUE)
}

Calibration

D-Calibration

Calculating p-values for D-Calibration as pchisq(score, 10 - 1, lower.tail = FALSE).

This represents more of a heuristic approach as an insignificant result implies a well-calibrated model, but a significant result does not necessarily imply a poorly calibrated model. Furthermore, there is no multiplicity correction applied due to the generally exploratory nature of the plots.

Code

aggr_scores |>
  dplyr::filter(grepl("isbs", tune_measure)) |>
  dplyr::mutate(
    dcalib_p = pchisq(dcalib, 10 - 1, lower.tail = FALSE),
    dcalib_label = fifelse(dcalib_p < 0.05, "X", "")
  ) |>
  ggplot(aes(
    x = forcats::fct_reorder(learner_id, dcalib_p),
    y = forcats::fct_rev(task_id),
    fill = dcalib_p
  )) +
  geom_tile(color = "#EEEEEE") +
  geom_text(aes(label = dcalib_label), color = "white", size = 3) +
  # scale_fill_manual(values = c(`TRUE` = "red", `FALSE` = "blue"), labels = c(`TRUE` = "Signif.", `FALSE` = "Not Signif.")) +
  scale_fill_viridis_c(breaks = seq(0, 1, .1)) +
  guides(
    x = guide_axis(n.dodge = 2),
    fill = guide_colorbar(
      title.vjust = .8,
      barwidth = unit(200, "pt")
    )
  ) +
  labs(
    title = "D-Calibration p-values by task and learner",
    subtitle = glue::glue(
      "Models tuned on {msr_tbl[id == 'isbs', label]}\n",
      "Learners ordered by average p-value. X denotes p < 0.05"
    ),
    y = "Task",
    x = "Learner",
    color = NULL,
    fill = "p-value"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title.position = "plot",
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.spacing.x = unit(5, "mm"),
    panel.background = element_rect(fill = "#EEEEEE", color = "#EEEEEE")
  )

Alpha-Calibration

For this measure, calibration is indicated by a score close to 1.

Code

ggplot(aggr_scores[grepl("isbs", tune_measure)], aes(y = forcats::fct_rev(learner_id), x = alpha_calib)) +
  geom_point() +
  geom_vline(xintercept = 1) +
  scale_x_log10() +
  labs(
    title = "Alpha-Calibration by task and learner",
    subtitle = glue::glue(
      "Models tuned on {msr_tbl[id == 'isbs', label]}\n",
      "Values close to 1 indicate reasonable calibration"
    ),
    y = "Learner",
    x = "Alpha (log10)"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title.position = "plot",
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank()
    # panel.spacing.x = unit(5, "mm"),
    # panel.background = element_rect(fill = "#EEEEEE", color = "#EEEEEE")
  )

Raw scores

Variable	Type	Description
`task_id`	`fct`	Dataset name, e.g. `veteran`
`learner_id`	`fct`	Model / learner name, e.g. `RAN` for `ranger`
`harrell_c`	`dbl`	Evaluation measure score
`uno_c`	`dbl`	Evaluation measure score
`isll`	`dbl`	Evaluation measure score
`isll_erv`	`dbl`	Evaluation measure score
`isbs`	`dbl`	Evaluation measure score
`isbs_erv`	`dbl`	Evaluation measure score
`dcalib`	`dbl`	Evaluation measure score
`alpha_calib`	`dbl`	Evaluation measure score
`tune_measure`	`chr`	Tuning measure, one of `harrell_c`, `isbs`
`learner_group`	`fct`	Model / learner group, one of “Baseline”, “Classical”m “Trees”, “Boosting”

Code

#|
aggr_scores |>
  dplyr::mutate(dplyr::across(dplyr::where(is.numeric), \(x) round(x, 3))) |>
  dplyr::arrange(task_id, learner_id) |>
  reactable::reactable(
    sortable = TRUE,
    filterable = TRUE,
    searchable = TRUE,
    defaultPageSize = 30
  )

Statistical Analysis

Global Friedman Test

Harrell’s C
ISBS

Code

bma_harrell_c$friedman_test(p.adjust.method = "holm") |>
  tablify()

	X2	df	p.value	p.adj.value	p.signif
harrell_c	307.4551	18	1.407696e-54	0	***
uno_c	297.5853	18	1.510542e-52	0	***

Code

bma_isbs$friedman_test(p.adjust.method = "holm") |>
  tablify()

	X2	df	p.value	p.signif
isll	243.7411	14	5.650619e-44	***
isll_erv	244.186	14	4.572789e-44	***
isbs	244.2825	14	4.367824e-44	***
isbs_erv	243.612	14	6.008274e-44	***
dcalib	126.2515	14	3.726951e-20	***
alpha_calib	218.4822	14	9.004844e-39	***

Critical Difference Plots: Bonferroni-Dunn

Using Cox (CPH) as baseline for comparison, these represent the primary result of the benchmark.

Harrell’s C
ISBS

Code

cd_ratio = 10 / 11

plot_bma(
  bma = bma_harrell_c,
  type = "cd_bd",
  measure_id = "harrell_c",
  tuning_measure_id = "harrell_c",
  ratio = cd_ratio,
  baseline = "CPH"
)

Warning in geom_segment(aes(x = 0, xend = max(rank) + 1, y = 0, yend = 0)): All aesthetics have length 1, but the data has 19 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

Code

plot_bma(
  bma = bma_isbs,
  type = "cd_bd",
  measure_id = "isbs",
  tuning_measure_id = "isbs",
  ratio = cd_ratio,
  baseline = "CPH"
)

Warning in geom_segment(aes(x = 0, xend = max(rank) + 1, y = 0, yend = 0)): All aesthetics have length 1, but the data has 15 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.