This page gives an overview of the benchmark results, including scores aggregated across outer resampling iterations used for later statistical analysis and individual scores per dataset and model.
In general, results are divided by underlying tuning measure, i.e. harrel_c and rcll, with the former being a measure of discrimination and the latter a proper scoring rule.
Errros and elapsed time limits
The following table lists the number of errors in the outer resampling iterations per tuning measure (tuned). These errors were caused by the learner exceeding the time limit or exceeding memory limitations. We attempted to resubmit failing computational jobs with increased memory limits, yet in some cases the jobs still failed with more than 100GB of available memory, at which point we considered the learner/task combination to just be infeasible.
We note:
the affected learners were particularly slow or memory intensive for large tasks with many observations or a large number of unique time points, where the latter in particular appeared even more relevant than the number of observations.
the tasks below are mostly those with many observations and unique time points.
SSVM is excluded due to persistent technical issues that we could not resolve, giving us no results to evaluate.
We therefore consider the errors to be a result of the learners’ complexity and the tasks’ size.
Code
scores |> dplyr::filter(errors !="") |> dplyr::count(learner_id, tuned, errors, name ="affected_folds") |> tidyr::pivot_wider(id_cols =c("learner_id", "errors"),names_from ="tuned",values_from ="affected_folds", values_fill =0 ) |> dplyr::mutate(total = harrell_c + rcll) |> kableExtra::kbl(col.names =c("Model", "Error", "Harrell's C", "RCLL", "Total Errors"),caption ="Number of errors per outer resampling iteration (up to five), separated by model, and tuning measure.",booktabs =TRUE,format ="latex" ) |> kableExtra::kable_styling()
Code
scores |> dplyr::filter(errors !="") |> dplyr::count(task_id, tuned, errors, name ="affected_folds") |> tidyr::pivot_wider(id_cols =c("task_id", "errors"),names_from ="tuned",values_from ="affected_folds", values_fill =0 ) |> dplyr::mutate(total = harrell_c + rcll) |> kableExtra::kbl(col.names =c("Dataset", "Error", "Harrell's C", "RCLL", "Total Errors"),caption ="Number of errors per outer resampling iteration (up to five), separated by dataset, and tuning measure.",booktabs =TRUE,format ="latex" ) |> kableExtra::kable_styling()
for (measure_id in msr_tbl[(id =="brier_improper"| type =="Discrimination") &!erv, id]) {plot_aggr_scores(aggr_scores, type ="box", eval_measure_id = measure_id, tuning_measure_id ="harrell_c", dodge =FALSE, flip =TRUE)}
Harrell’s C (Scaled)
Code
for (measure_id in msr_tbl[(id =="brier_improper"| type =="Discrimination") &!erv, id]) {plot_aggr_scores(aggr_scores_scaled, type ="box", eval_measure_id = measure_id, tuning_measure_id ="harrell_c", dodge =FALSE, flip =TRUE) %+%labs(title = glue::glue("{msr_tbl[id == measure_id, label]} [Scaled KM-Best]"),subtitle ="Boxplot of aggregated scores across all tasks scaled such that 0 = KM, 1 = Best model" )}
Code
for (measure_id in msr_tbl[type =="Scoring Rule"&!erv, id]) {plot_aggr_scores(aggr_scores, type ="box", eval_measure_id = measure_id, tuning_measure_id ="rcll", dodge =FALSE, flip =TRUE)}
RCLL (ERV)
Code
for (measure_id in msr_tbl[type =="Scoring Rule"& erv, id]) {plot_aggr_scores(aggr_scores, type ="box", eval_measure_id = measure_id, tuning_measure_id ="rcll", dodge =FALSE, flip =TRUE)}
Scaled RCLL
Code
for (measure_id in msr_tbl[type =="Scoring Rule"&!erv, id]) {plot_aggr_scores(aggr_scores_scaled, type ="box", eval_measure_id = measure_id, tuning_measure_id ="rcll", dodge =FALSE, flip =TRUE) %+%labs(title = glue::glue("{msr_tbl[id == measure_id, label]} [Scaled KM-Best]"),subtitle ="Boxplot of aggregated scores across all tasks scaled such that 0 = KM, 1 = Best model" )}
Results per Dataset / scores
Taking scores from the outer evaluation folds, see scores.[csv|rds].
for (measure_id in msr_tbl[type =="Scoring Rule"&!erv, id]) {plot_scores(scores, eval_measure_id = measure_id, tuning_measure_id ="rcll", dodge =FALSE, flip =TRUE)}
Code
for (measure_id in msr_tbl[type =="Scoring Rule"&!erv, id]) {plot_scores(scores, eval_measure_id = measure_id, tuning_measure_id ="harrell_c", dodge =FALSE, flip =TRUE)}
Calibration
D-Calibration
Calculating p-values for D-Calibration as pchisq(score, 10 - 1, lower.tail = FALSE).
This represents more of a heuristic approach as an insignificant result implies a well-calibrated model, but a significant result does not necessarily imply a poorly calibrated model. Furthermore, there is no multiplicity correction applied due to the generally exploratory nature of the plots.
Code
for (tuned_on inc("harrell_c", "rcll")) { p = aggr_scores |> dplyr::filter(tuned == tuned_on) |> dplyr::mutate(dcalib_p =pchisq(dcalib, 10-1, lower.tail =FALSE),dcalib_label =fifelse(dcalib_p <0.05, "X", "") ) |>ggplot(aes(x = forcats::fct_reorder(learner_id, dcalib_p), y = forcats::fct_rev(task_id), fill = dcalib_p) ) +geom_tile(color ="#EEEEEE") +geom_text(aes(label = dcalib_label), color ="white", size =3) +# scale_fill_manual(values = c(`TRUE` = "red", `FALSE` = "blue"), labels = c(`TRUE` = "Signif.", `FALSE` = "Not Signif.")) +scale_fill_viridis_c(breaks =seq(0, 1, .1)) +guides(x =guide_axis(n.dodge =2), fill =guide_colorbar(title.vjust = .8,barwidth =unit(200, "pt") )) +labs(title ="D-Calibration p-values by task and learner",subtitle = glue::glue("Models tuned on {msr_tbl[id == tuned_on, label]}\n","Learners ordered by average p-value. X denotes p < 0.05" ),y ="Task", x ="Learner", color =NULL, fill ="p-value" ) +theme_minimal() +theme(legend.position ="bottom",plot.title.position ="plot",panel.grid.major.y =element_blank(),panel.grid.minor.y =element_blank(),panel.spacing.x =unit(5, "mm"),panel.background =element_rect(fill ="#EEEEEE", color ="#EEEEEE") )print(p)}
Alpha-Calibration
For this measure, calibration is indicated by a score close to 1.
Code
for (tuned_on inc("harrell_c", "rcll")) { p =ggplot(aggr_scores[tuned == tuned_on], aes(y = forcats::fct_rev(learner_id), x = caliba_ratio)) +geom_point() +geom_vline(xintercept =1) +scale_x_log10() +labs(title ="Alpha-Calibration by task and learner",subtitle = glue::glue("Models tuned on {msr_tbl[id == tuned_on, label]}\n","Values close to 1 indicate reasonable calibration" ),y ="Learner", x ="Alpha (log10)" ) +theme_minimal() +theme(legend.position ="bottom",plot.title.position ="plot",panel.grid.major.y =element_blank(),panel.grid.minor.y =element_blank()# panel.spacing.x = unit(5, "mm"),# panel.background = element_rect(fill = "#EEEEEE", color = "#EEEEEE") )print(p)}
Raw scores
Variable
Type
Description
task_id
fct
Dataset name, e.g. veteran
learner_id
fct
Model / learner name, e.g. RAN for ranger
harrell_c
dbl
Evaluation measure score
uno_c
dbl
Evaluation measure score
rcll
dbl
Evaluation measure score
rcll_erv
dbl
Evaluation measure score
logloss
dbl
Evaluation measure score
logloss_erv
dbl
Evaluation measure score
intlogloss
dbl
Evaluation measure score
intlogloss_erv
dbl
Evaluation measure score
brier_proper
dbl
Evaluation measure score
brier_proper_erv
dbl
Evaluation measure score
brier_improper
dbl
Evaluation measure score
brier_improper_erv
dbl
Evaluation measure score
dcalib
dbl
Evaluation measure score
caliba_ratio
dbl
Evaluation measure score
caliba_diff
dbl
Evaluation measure score
tuned
chr
Tuning measure, one of harrell_c, rcll
learner_group
fct
Model / learner group, one of “Baseline”, “Classical”m “Trees”, “Boosting”