Summarising cross-validation metrics using data.table R

Question

I am struggling to summarise my cross-validation results in R. A grid search has been carried out across two parameters. I am using forward chaining cross-validation so have results across 3 years (2018:2020), and have 100 sub-models that I would like to summarise results across.

I've replicated the structure of my results here:

library(data.table)

# Set random seed for reproducibility
set.seed(42)

# Define unique sub-models, years, and hyperparameters
sub_models <- paste0("model_", sprintf("%03d", 1:100))
years <- 2018:2020
hyperparam_binary <- c(TRUE, FALSE)
hyperparam_numeric <- c(3, 4, 5)

# Create all combinations using data.table's CJ() (cross join)
dummy_data <- CJ(sub_model = sub_models,
                 year = years,
                 hyperparam_binary = hyperparam_binary,
                 hyperparam_numeric = hyperparam_numeric)

# Generate random performance metrics
dummy_data[, MAE := round(runif(.N, min = 1000, max = 5000), 2)]
dummy_data[, RMSE := round(runif(.N, min = 2000, max = 7000), 2)]

I would like to summarise these results by calculating the mean and standard deviation of MAE and RMSE for each combination of hyper-parameters. I tried doing this:

hyperparameter_cols <- c("hyperparam_binary", "hyperparam_numeric")

metric_cols <- c("MAE", "RMSE")

summary_results <- dummy_data[, lapply(.SD, function(x) list(mean = mean(x, na.rm = TRUE), 
                                                             sd = sd(x, na.rm = TRUE))),
                              by = hyperparameter_cols,
                              .SDcols = metric_cols]

But the resulting summary_results table isn't really what I want - it just seems to append mean and sd.

I also want the average rank for each parameter combination, where rank is assigned for each sub-model/year combination. I tried doing this:

ranked_results <- dummy_data[, lapply(.SD, function(x) rank(x, ties.method = "average")), 
                             by = c("sub_model", "year"), 
                             .SDcols = metric_cols]

which seems to work in providing ranks. But it gets rid of the parameter info, so I'm not sure how I can average them.

I feel like there must be a neater solution in data.table. Any help would be much appreciated!

langtang · Accepted Answer

You can try something like this:

f <- \(x) lapply(x, \(i) data.table(m = mean(i), s= sd(i)))

dummy_data[, do.call(cbind, f(.SD)), hyperparameter_cols,.SDcols = metric_cols]

Output:

   hyperparam_binary hyperparam_numeric    MAE.m    MAE.s   RMSE.m   RMSE.s
                                            
1:             FALSE                  3 2978.239 1126.644 4563.335 1426.583
2:             FALSE                  4 2925.884 1199.488 4657.478 1543.213
3:             FALSE                  5 2895.942 1165.870 4477.693 1405.294
4:              TRUE                  3 2999.622 1201.470 4550.635 1469.597
5:              TRUE                  4 2988.928 1120.362 4487.522 1520.403
6:              TRUE                  5 2918.992 1194.360 4591.984 1469.488

For your second question about the ranks, you can assign those columns directly to dummy_data using :=. Specifically, do something like this:

rank_cols = paste0(metric_cols, "rank")
dummy_data[
  ,
  c(rank_cols):=lapply(.SD, rank),
  .(sub_model, year),
  .SDcols = metric_cols
]

You could then include these rank_cols when you call f(), or you could work on them separately. An example of the former is:

dummy_data[, do.call(cbind, f(.SD)), hyperparameter_cols,.SDcols = c(metric_cols, rank_cols)]

Output:

   hyperparam_binary hyperparam_numeric    MAE.m    MAE.s   RMSE.m   RMSE.s MAErank.m MAErank.s RMSErank.m RMSErank.s
                                                                  
1:             FALSE                  3 2978.239 1126.644 4563.335 1426.583  3.543333  1.640251   3.460000   1.658585
2:             FALSE                  4 2925.884 1199.488 4657.478 1543.213  3.440000  1.750222   3.663333   1.796500
3:             FALSE                  5 2895.942 1165.870 4477.693 1405.294  3.470000  1.743339   3.393333   1.633253
4:              TRUE                  3 2999.622 1201.470 4550.635 1469.597  3.523333  1.760622   3.530000   1.680828
5:              TRUE                  4 2988.928 1120.362 4487.522 1520.403  3.543333  1.640251   3.393333   1.797049
6:              TRUE                  5 2918.992 1194.360 4591.984 1469.488  3.480000  1.722253   3.560000   1.674039

An example of the latter is:

dummy_data[, lapply(.SD, mean), hyperparameter_cols, .SDcols = rank_cols]

Output:

   hyperparam_binary hyperparam_numeric  MAErank RMSErank
                                    
1:             FALSE                  3 3.543333 3.460000
2:             FALSE                  4 3.440000 3.663333
3:             FALSE                  5 3.470000 3.393333
4:              TRUE                  3 3.523333 3.530000
5:              TRUE                  4 3.543333 3.393333
6:              TRUE                  5 3.480000 3.560000

Summarising cross-validation metrics using data.table R

Answers (2)

Related Questions