This page gives an overview of the benchmark results, including scores aggregated across outer resampling iterations used for later statistical analysis and individual scores per dataset and model.
In general, results are divided by underlying tuning measure, i.e. harrel_c and isbs, with the former being a measure of discrimination and the latter a proper scoring rule.
Errros and elapsed time limits
The following table lists the number of errors in the outer resampling iterations per tuning measure (tuned). These errors were caused by the learner exceeding the time limit or exceeding memory limitations. We attempted to resubmit failing computational jobs with increased memory limits, yet in some cases the jobs still failed with more than 100GB of available memory, at which point we considered the learner/task combination to just be infeasible.
We note:
the affected learners were particularly slow or memory intensive for large tasks with many observations or a large number of unique time points, where the latter in particular appeared even more relevant than the number of observations.
the tasks below are most often those with many observations and unique time points (hdfail, child, check_times).
We therefore consider the errors to be a result of the learners’ complexity and the tasks’ size, given reasonable computational constraints.
Click to view table
Code
err_tbl <- scores |> dplyr::group_by(learner_id, task_id, tune_measure) |> dplyr::summarise(affected_iterations =sum(errors_cnt >0),total_iterations = dplyr::n(),.groups ="drop" ) |> dplyr::filter(affected_iterations >0) |> dplyr::mutate(error_rate =round(100* affected_iterations / total_iterations, 1),errors_fmt = glue::glue("{affected_iterations} / {total_iterations} ({error_rate}%)") ) |> tidyr::pivot_wider(id_cols =c("learner_id", "task_id"),names_from ="tune_measure",values_from ="errors_fmt",values_fill ="—" )err_tbl |> dplyr::select(-learner_id) |> kableExtra::kbl(col.names =c("Dataset", "Harrell's C", "ISBS"),caption ="Number of evaluations with errors of the total outer resampling iterations by tuning measure. (—) indicates there were no errors during evaluation, but possible during tuning." ) |> kableExtra::kable_styling() |> kableExtra::pack_rows(index =table(err_tbl$learner_id))
Number of evaluations with errors of the total outer resampling iterations by tuning measure. (—) indicates there were no errors during evaluation, but possible during tuning.
Dataset
Harrell's C
ISBS
AK
CarpenterFdaData
1 / 30 (3.3%)
—
channing
1 / 30 (3.3%)
1 / 30 (3.3%)
child
3 / 3 (100%)
3 / 3 (100%)
e1684
—
3 / 30 (10%)
hdfail
3 / 3 (100%)
3 / 3 (100%)
lung
—
8 / 30 (26.7%)
uis
—
2 / 30 (6.7%)
veteran
—
3 / 30 (10%)
CIF
child
3 / 3 (100%)
3 / 3 (100%)
hdfail
3 / 3 (100%)
3 / 3 (100%)
Flex
aids.id
10 / 30 (33.3%)
—
check_times
3 / 3 (100%)
3 / 3 (100%)
child
3 / 3 (100%)
3 / 3 (100%)
dataFTR
—
2 / 30 (6.7%)
hdfail
3 / 3 (100%)
3 / 3 (100%)
lung
—
9 / 30 (30%)
nafld1
14 / 15 (93.3%)
14 / 15 (93.3%)
nwtco
15 / 15 (100%)
15 / 15 (100%)
support
3 / 3 (100%)
3 / 3 (100%)
wa_churn
15 / 15 (100%)
15 / 15 (100%)
GLMN
bladder0
—
1 / 30 (3.3%)
channing
1 / 30 (3.3%)
—
check_times
—
2 / 3 (66.7%)
cost
—
12 / 30 (40%)
dataSTR
—
2 / 30 (6.7%)
hdfail
3 / 3 (100%)
—
std
—
6 / 30 (20%)
uis
—
4 / 30 (13.3%)
veteran
14 / 30 (46.7%)
—
wbc1
4 / 30 (13.3%)
—
MBSTAFT
hdfail
—
2 / 3 (66.7%)
MBSTCox
child
3 / 3 (100%)
3 / 3 (100%)
dataSTR
1 / 30 (3.3%)
—
hdfail
3 / 3 (100%)
3 / 3 (100%)
ORSF
child
3 / 3 (100%)
3 / 3 (100%)
cost
—
1 / 30 (3.3%)
gbsg
—
1 / 15 (6.7%)
hdfail
3 / 3 (100%)
3 / 3 (100%)
nafld1
9 / 15 (60%)
1 / 15 (6.7%)
uis
1 / 30 (3.3%)
—
veteran
—
1 / 30 (3.3%)
Pen
aids.id
9 / 30 (30%)
1 / 30 (3.3%)
bladder0
—
8 / 30 (26.7%)
channing
—
1 / 30 (3.3%)
check_times
3 / 3 (100%)
3 / 3 (100%)
cost
—
3 / 30 (10%)
dataSTR
3 / 30 (10%)
11 / 30 (36.7%)
hdfail
2 / 3 (66.7%)
—
RAN
check_times
2 / 3 (66.7%)
3 / 3 (100%)
child
3 / 3 (100%)
3 / 3 (100%)
cost
—
1 / 30 (3.3%)
hdfail
1 / 3 (33.3%)
1 / 3 (33.3%)
mgus
2 / 30 (6.7%)
—
nafld1
9 / 15 (60%)
4 / 15 (26.7%)
RFSRC
check_times
3 / 3 (100%)
3 / 3 (100%)
child
3 / 3 (100%)
3 / 3 (100%)
colrec
2 / 3 (66.7%)
1 / 3 (33.3%)
nafld1
1 / 15 (6.7%)
2 / 15 (13.3%)
support
3 / 3 (100%)
2 / 3 (66.7%)
RRT
dataFTR
—
5 / 30 (16.7%)
lung
—
8 / 30 (26.7%)
metabric
—
7 / 15 (46.7%)
nwtco
—
7 / 15 (46.7%)
ova
—
3 / 30 (10%)
tumor
—
3 / 30 (10%)
SSVM
check_times
—
3 / 3 (100%)
child
—
3 / 3 (100%)
colrec
—
3 / 3 (100%)
flchain
—
11 / 15 (73.3%)
hdfail
—
3 / 3 (100%)
nafld1
—
15 / 15 (100%)
nwtco
—
8 / 15 (53.3%)
ova
—
3 / 30 (10%)
support
—
3 / 3 (100%)
wa_churn
—
15 / 15 (100%)
Aggregated Results
Averaged scores across outer resampling folds for each task and learner.
for (measure_id in msr_tbl[type =="Discrimination", id]) {plot_aggr_scores( aggr_scores,type ="box",eval_measure_id = measure_id,tuning_measure_id ="harrell_c",dodge =FALSE,flip =TRUE )}
Harrell’s C (Scaled)
Code
for (measure_id in msr_tbl[type =="Discrimination", id]) {plot_aggr_scores( aggr_scores_scaled,type ="box",eval_measure_id = measure_id,tuning_measure_id ="harrell_c",dodge =FALSE,flip =TRUE ) %+%labs(title = glue::glue("{msr_tbl[id == measure_id, label]} [Scaled KM – Best]"),subtitle ="Boxplot of aggregated scores across all tasks scaled such that 0 = KM, 1 = Best model" )}
Code
for (measure_id in msr_tbl[type =="Scoring Rule"&!erv, id]) {plot_aggr_scores( aggr_scores,type ="box",eval_measure_id = measure_id,tuning_measure_id ="isbs",dodge =FALSE,flip =TRUE )}
ISBS (ERV)
Code
for (measure_id in msr_tbl[type =="Scoring Rule"& erv, id]) {plot_aggr_scores( aggr_scores,type ="box",eval_measure_id = measure_id,tuning_measure_id ="isbs",dodge =FALSE,flip =TRUE )}
Scaled ISBS
Code
for (measure_id in msr_tbl[type =="Scoring Rule"&!erv, id]) {plot_aggr_scores( aggr_scores_scaled,type ="box",eval_measure_id = measure_id,tuning_measure_id ="isbs",dodge =FALSE,flip =TRUE ) %+%labs(title = glue::glue("{msr_tbl[id == measure_id, label]} [Scaled KM-Best]"),subtitle ="Boxplot of aggregated scores across all tasks scaled such that 0 = KM, 1 = Best model" )}
Results per Dataset
Taking scores from the outer evaluation folds, see scores.[csv|rds].
for (measure_id in msr_tbl[type =="Discrimination", id]) {plot_scores(scores, eval_measure_id = measure_id, tuning_measure_id ="harrell_c", dodge =FALSE, flip =TRUE)}
Code
for (measure_id in msr_tbl[type =="Scoring Rule"&!erv, id]) {plot_scores(scores, eval_measure_id = measure_id, tuning_measure_id ="isbs", dodge =FALSE, flip =TRUE)}
Calibration
D-Calibration
Calculating p-values for D-Calibration as pchisq(score, 10 - 1, lower.tail = FALSE).
This represents more of a heuristic approach as an insignificant result implies a well-calibrated model, but a significant result does not necessarily imply a poorly calibrated model. Furthermore, there is no multiplicity correction applied due to the generally exploratory nature of the plots.
Warning in geom_segment(aes(x = 0, xend = max(rank) + 1, y = 0, yend = 0)): All aesthetics have length 1, but the data has 19 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_segment(aes(x = 0, xend = max(rank) + 1, y = 0, yend = 0)): All aesthetics have length 1, but the data has 15 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.