Burk et al. (2024)
  • Home
  • Results
  • Models, Data, Measures

On this page

  • Errros and elapsed time limits
  • Aggregated Results
    • Boxplots
  • Results per Dataset / scores
    • Boxplots
    • Calibration
      • D-Calibration
      • Alpha-Calibration
    • Raw scores
  • Statistical Analysis
    • Global Friedman Test
    • Critical Difference Plots: Bonferroni-Dunn
  • Report an issue

This page gives an overview of the benchmark results, including scores aggregated across outer resampling iterations used for later statistical analysis and individual scores per dataset and model.

In general, results are divided by underlying tuning measure, i.e. harrel_c and rcll, with the former being a measure of discrimination and the latter a proper scoring rule.

Errros and elapsed time limits

The following table lists the number of errors in the outer resampling iterations per tuning measure (tuned). These errors were caused by the learner exceeding the time limit or exceeding memory limitations. We attempted to resubmit failing computational jobs with increased memory limits, yet in some cases the jobs still failed with more than 100GB of available memory, at which point we considered the learner/task combination to just be infeasible.

We note:

  • the affected learners were particularly slow or memory intensive for large tasks with many observations or a large number of unique time points, where the latter in particular appeared even more relevant than the number of observations.
  • the tasks below are mostly those with many observations and unique time points.
  • SSVM is excluded due to persistent technical issues that we could not resolve, giving us no results to evaluate.

We therefore consider the errors to be a result of the learners’ complexity and the tasks’ size.

Code
scores |>
  dplyr::filter(errors != "") |>
  dplyr::count(learner_id, tuned, errors, name = "affected_folds") |>
  tidyr::pivot_wider(
    id_cols = c("learner_id", "errors"),
    names_from = "tuned",
    values_from = "affected_folds", 
    values_fill = 0
  ) |>
  dplyr::mutate(total = harrell_c + rcll) |>
  kableExtra::kbl(
    col.names = c("Model", "Error", "Harrell's C", "RCLL", "Total Errors"),
    caption = "Number of errors per outer resampling iteration (up to five), separated by model, and tuning measure.",
    booktabs = TRUE,
    format = "latex"
  ) |>
  kableExtra::kable_styling()
Code
scores |>
  dplyr::filter(errors != "") |>
  dplyr::count(task_id, tuned, errors, name = "affected_folds") |>
  tidyr::pivot_wider(
    id_cols = c("task_id", "errors"),
    names_from = "tuned",
    values_from = "affected_folds", 
    values_fill = 0
  ) |>
  dplyr::mutate(total = harrell_c + rcll) |>
  kableExtra::kbl(
    col.names = c("Dataset", "Error", "Harrell's C", "RCLL", "Total Errors"),
    caption = "Number of errors per outer resampling iteration (up to five), separated by dataset, and tuning measure.",
    booktabs = TRUE,
    format = "latex"
  ) |>
  kableExtra::kable_styling()
Click to view detailed table
Code
scores |>
  dplyr::filter(errors != "") |>
  count(learner_id, task_id, tuned, errors, name = "affected_folds") |>
  reactable::reactable(pagination = FALSE, filterable = TRUE, sortable = TRUE)

Aggregated Results

Averaged scores across outer resampling folds for each task and learner.

Boxplots

  • Harrell’s C
  • RCLL
Code
for (measure_id in msr_tbl[(id == "brier_improper" | type == "Discrimination") & !erv, id]) {
  plot_aggr_scores(aggr_scores, type = "box", eval_measure_id = measure_id, tuning_measure_id = "harrell_c", dodge = FALSE, flip = TRUE)
}

Harrell’s C (Scaled)

Code
for (measure_id in msr_tbl[(id == "brier_improper" | type == "Discrimination") & !erv, id]) {
  plot_aggr_scores(aggr_scores_scaled, type = "box", eval_measure_id = measure_id, tuning_measure_id = "harrell_c", dodge = FALSE, flip = TRUE) %+%
    labs(
      title = glue::glue("{msr_tbl[id == measure_id, label]} [Scaled KM-Best]"),
      subtitle = "Boxplot of aggregated scores across all tasks scaled such that 0 = KM, 1 = Best model"
    )
}

Code
for (measure_id in msr_tbl[type == "Scoring Rule" & !erv, id]) {
  plot_aggr_scores(aggr_scores, type = "box", eval_measure_id = measure_id, tuning_measure_id = "rcll", dodge = FALSE, flip = TRUE)
}

RCLL (ERV)

Code
for (measure_id in msr_tbl[type == "Scoring Rule" & erv, id]) {
  plot_aggr_scores(aggr_scores, type = "box", eval_measure_id = measure_id, tuning_measure_id = "rcll", dodge = FALSE, flip = TRUE)
}

Scaled RCLL

Code
for (measure_id in msr_tbl[type == "Scoring Rule" & !erv, id]) {
  plot_aggr_scores(aggr_scores_scaled, type = "box", eval_measure_id = measure_id, tuning_measure_id = "rcll", dodge = FALSE, flip = TRUE) %+%
    labs(
      title = glue::glue("{msr_tbl[id == measure_id, label]} [Scaled KM-Best]"),
      subtitle = "Boxplot of aggregated scores across all tasks scaled such that 0 = KM, 1 = Best model"
    )
}

Results per Dataset / scores

Taking scores from the outer evaluation folds, see scores.[csv|rds].

Boxplots

  • Harrell’s C
  • RCLL
Code
for (measure_id in msr_tbl[type == "Scoring Rule" & !erv, id]) {
  plot_scores(scores, eval_measure_id = measure_id, tuning_measure_id = "rcll", dodge = FALSE, flip = TRUE)
}

Code
for (measure_id in msr_tbl[type == "Scoring Rule" & !erv, id]) {
  plot_scores(scores, eval_measure_id = measure_id, tuning_measure_id = "harrell_c", dodge = FALSE, flip = TRUE)
}

Calibration

D-Calibration

Calculating p-values for D-Calibration as pchisq(score, 10 - 1, lower.tail = FALSE).

This represents more of a heuristic approach as an insignificant result implies a well-calibrated model, but a significant result does not necessarily imply a poorly calibrated model. Furthermore, there is no multiplicity correction applied due to the generally exploratory nature of the plots.

Code
for (tuned_on in c("harrell_c", "rcll")) {
  p = aggr_scores |>
    dplyr::filter(tuned == tuned_on) |>
    dplyr::mutate(
      dcalib_p = pchisq(dcalib, 10 - 1, lower.tail = FALSE),
      dcalib_label = fifelse(dcalib_p < 0.05, "X", "")
    ) |>
    ggplot(aes(
      x = forcats::fct_reorder(learner_id, dcalib_p), 
      y = forcats::fct_rev(task_id), 
      fill = dcalib_p)
    ) +
    geom_tile(color = "#EEEEEE") +
    geom_text(aes(label = dcalib_label), color = "white", size = 3) +
    # scale_fill_manual(values = c(`TRUE` = "red", `FALSE` = "blue"), labels = c(`TRUE` = "Signif.", `FALSE` = "Not Signif.")) +
    scale_fill_viridis_c(breaks = seq(0, 1, .1)) +
    guides(
      x = guide_axis(n.dodge = 2), 
      fill = guide_colorbar(
        title.vjust = .8,
        barwidth = unit(200, "pt")
    )) +
    labs(
      title = "D-Calibration p-values by task and learner",
      subtitle = glue::glue(
        "Models tuned on {msr_tbl[id == tuned_on, label]}\n",
        "Learners ordered by average p-value. X denotes p < 0.05"
      ),
      y = "Task", x = "Learner", color = NULL, fill = "p-value"
    ) +
    theme_minimal() +
    theme(
      legend.position = "bottom",
      plot.title.position = "plot",
      panel.grid.major.y = element_blank(),
      panel.grid.minor.y = element_blank(),
      panel.spacing.x = unit(5, "mm"),
      panel.background = element_rect(fill = "#EEEEEE", color = "#EEEEEE")
    )
  print(p)
}

Alpha-Calibration

For this measure, calibration is indicated by a score close to 1.

Code
for (tuned_on in c("harrell_c", "rcll")) {
  p = ggplot(aggr_scores[tuned == tuned_on], aes(y = forcats::fct_rev(learner_id), x = caliba_ratio)) +
    geom_point() +
    geom_vline(xintercept = 1) +
    scale_x_log10() +
    labs(
      title = "Alpha-Calibration by task and learner",
      subtitle = glue::glue(
        "Models tuned on {msr_tbl[id == tuned_on, label]}\n",
        "Values close to 1 indicate reasonable calibration"
      ),
      y = "Learner", x = "Alpha (log10)"
    ) +
    theme_minimal() +
    theme(
      legend.position = "bottom",
      plot.title.position = "plot",
      panel.grid.major.y = element_blank(),
      panel.grid.minor.y = element_blank()
      # panel.spacing.x = unit(5, "mm"),
      # panel.background = element_rect(fill = "#EEEEEE", color = "#EEEEEE")
    )

  print(p)

}

Raw scores

Variable Type Description
task_id fct Dataset name, e.g. veteran
learner_id fct Model / learner name, e.g. RAN for ranger
harrell_c dbl Evaluation measure score
uno_c dbl Evaluation measure score
rcll dbl Evaluation measure score
rcll_erv dbl Evaluation measure score
logloss dbl Evaluation measure score
logloss_erv dbl Evaluation measure score
intlogloss dbl Evaluation measure score
intlogloss_erv dbl Evaluation measure score
brier_proper dbl Evaluation measure score
brier_proper_erv dbl Evaluation measure score
brier_improper dbl Evaluation measure score
brier_improper_erv dbl Evaluation measure score
dcalib dbl Evaluation measure score
caliba_ratio dbl Evaluation measure score
caliba_diff dbl Evaluation measure score
tuned chr Tuning measure, one of harrell_c, rcll
learner_group fct Model / learner group, one of “Baseline”, “Classical”m “Trees”, “Boosting”
Code
aggr_scores |>
  dplyr::mutate(dplyr::across(dplyr::where(is.numeric), \(x) round(x, 3))) |>
  dplyr::arrange(task_id, learner_id) |>
  reactable::reactable(
    sortable = TRUE, filterable = TRUE, searchable = TRUE, defaultPageSize = 30
  )

Statistical Analysis

Global Friedman Test

  • Harrell’s C
  • RCLL
Code
bma_harrell_c$friedman_test(p.adjust.method = "holm") |>
  tablify()
X2 df p.value p.adj.value p.signif
harrell_c 254.6476 16 5.758e-45 0 ***
uno_c 237.3772 16 1.989708e-41 0 ***
rcll 356.9389 16 3.693991e-66 0 ***
rcll_erv 355.3038 16 8.103344e-66 0 ***
logloss 352.4314 16 3.220167e-65 0 ***
logloss_erv 351.8368 16 4.284423e-65 0 ***
intlogloss 208.5205 16 1.494946e-35 0 ***
intlogloss_erv 204.2794 16 1.08075e-34 0 ***
brier_proper 208.2553 16 1.691888e-35 0 ***
brier_proper_erv 208.0123 16 1.89511e-35 0 ***
brier_improper 346.8794 16 4.628926e-64 0 ***
brier_improper_erv 345.922 16 7.328869e-64 0 ***
dcalib 282.1292 16 1.264306e-50 0 ***
caliba_ratio 238.8916 16 9.752463e-42 0 ***
caliba_diff 285.2068 16 2.926213e-51 0 ***
Code
bma_rcll$friedman_test(p.adjust.method = "holm") |>
  tablify()
X2 df p.value p.adj.value p.signif
harrell_c 304.5595 16 2.897978e-55 0 ***
uno_c 297.5813 16 8.080138e-54 0 ***
rcll 315.9809 16 1.239219e-57 0 ***
rcll_erv 315.3166 16 1.702333e-57 0 ***
logloss 324.3623 16 2.250208e-59 0 ***
logloss_erv 325.0714 16 1.602662e-59 0 ***
intlogloss 227.1346 16 2.455089e-39 0 ***
intlogloss_erv 223.3283 16 1.46448e-38 0 ***
brier_proper 217.539 16 2.20664e-37 0 ***
brier_proper_erv 220.9343 16 4.49855e-38 0 ***
brier_improper 333.1472 16 3.352373e-61 0 ***
brier_improper_erv 332.334 16 4.949315e-61 0 ***
dcalib 228.427 16 1.338183e-39 0 ***
caliba_ratio 255.2287 16 4.374825e-45 0 ***
caliba_diff 295.252 16 2.451829e-53 0 ***

Critical Difference Plots: Bonferroni-Dunn

Using Cox (CPH) as baseline for comparison, these represent the primary result of the benchmark.

  • Harrell’s C
  • RCLL
Code
cd_ratio = 10/12

plot_results(bma = bma_harrell_c, type = "cd_bd", measure_id = "harrell_c", tuning_measure_id = "harrell_c", ratio = cd_ratio, baseline = "CPH")

Code
cd_ratio = 10/12

plot_results(bma = bma_harrell_c, type = "cd_bd", measure_id = "brier_improper", tuning_measure_id = "harrell_c", ratio = cd_ratio, baseline = "CPH")

Code
plot_results(bma = bma_rcll, type = "cd_bd", measure_id = "rcll", tuning_measure_id = "rcll", ratio = cd_ratio, baseline = "CPH")

Code
plot_results(bma = bma_rcll, type = "cd_bd", measure_id = "brier_improper", tuning_measure_id = "rcll", ratio = cd_ratio, baseline = "CPH")

  • Report an issue