This page presents aggregated benchmark results, including statistical tests and critical difference plots. Results are divided by tuning measure: harrell_c (discrimination) and isbs (proper scoring rule).
Aggregated Results
Averaged scores across outer resampling folds for each task and learner.
Discrimination
Overall Performance (ISBS)
Calibration
D-Calibration
Calculating p-values for D-Calibration as pchisq(score, 10 - 1, lower.tail = FALSE).
This represents more of a heuristic approach as an insignificant result implies a well-calibrated model, but a significant result does not necessarily imply a poorly calibrated model. Furthermore, there is no multiplicity correction applied due to the generally exploratory nature of the plots.
Alpha-Calibration
For this measure, calibration is indicated by a score close to 1. The red vertical line marks perfect calibration (alpha = 1).
Statistical Analysis
Global Friedman Test
| X2 | df | p.value | p.adj.value | p.signif | |
|---|---|---|---|---|---|
| harrell_c | 329.508 | 20 | 7.315587e-58 | 0 | *** |
| uno_c | 318.474 | 20 | 1.34273e-55 | 0 | *** |
| X2 | df | p.value | p.adj.value | p.signif | |
|---|---|---|---|---|---|
| isll | 297.9222 | 16 | 6.868346e-54 | 0 | *** |
| isll_erv | 299.362 | 16 | 3.45742e-54 | 0 | *** |
| isbs | 301.7602 | 16 | 1.101756e-54 | 0 | *** |
| isbs_erv | 301.8054 | 16 | 1.078233e-54 | 0 | *** |
| dcalib | 195.6561 | 16 | 5.977296e-33 | 0 | *** |
| alpha_calib | 227.3751 | 16 | 2.192937e-39 | 0 | *** |
Critical Difference Plots: Bonferroni-Dunn
Using Cox (CPH) as baseline for comparison, these represent the primary result of the benchmark.
Raw scores
| Column | Type | Description |
|---|---|---|
| Group | fct |
Model / learner group, one of “Baseline”, “Classical”, “Trees”, “Boosting” |
| Learner | fct |
Model / learner name, e.g. RAN for ranger |
| Task | fct |
Dataset name, e.g. veteran |
| Tuning | chr |
Tuning measure, one of harrell_c, isbs, or harrell_c,isbs for untuned learners or “self-tuners” |
harrell_c, uno_c, isll, isll_erv, isbs, isbs_erv, dcalib, alpha_calib |
dbl |
Evaluation measure score |
| warn | int |
Number of warnings encountered during outer resampling |
| err | int |
Number of errors encountered during outer resampling. Errors indicate failure and the prediction of KM was inserted as fallback |

































