Reputation: 1815
I am struggling to summarise my cross-validation results in R. A grid search has been carried out across two parameters. I am using forward chaining cross-validation so have results across 3 years (2018:2020), and have 100 sub-models that I would like to summarise results across.
I've replicated the structure of my results here:
library(data.table)
# Set random seed for reproducibility
set.seed(42)
# Define unique sub-models, years, and hyperparameters
sub_models <- paste0("model_", sprintf("%03d", 1:100))
years <- 2018:2020
hyperparam_binary <- c(TRUE, FALSE)
hyperparam_numeric <- c(3, 4, 5)
# Create all combinations using data.table's CJ() (cross join)
dummy_data <- CJ(sub_model = sub_models,
year = years,
hyperparam_binary = hyperparam_binary,
hyperparam_numeric = hyperparam_numeric)
# Generate random performance metrics
dummy_data[, MAE := round(runif(.N, min = 1000, max = 5000), 2)]
dummy_data[, RMSE := round(runif(.N, min = 2000, max = 7000), 2)]
I would like to summarise these results by calculating the mean and standard deviation of MAE and RMSE for each combination of hyper-parameters. I tried doing this:
hyperparameter_cols <- c("hyperparam_binary", "hyperparam_numeric")
metric_cols <- c("MAE", "RMSE")
summary_results <- dummy_data[, lapply(.SD, function(x) list(mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE))),
by = hyperparameter_cols,
.SDcols = metric_cols]
But the resulting summary_results table isn't really what I want - it just seems to append mean and sd.
I also want the average rank for each parameter combination, where rank is assigned for each sub-model/year combination. I tried doing this:
ranked_results <- dummy_data[, lapply(.SD, function(x) rank(x, ties.method = "average")),
by = c("sub_model", "year"),
.SDcols = metric_cols]
which seems to work in providing ranks. But it gets rid of the parameter info, so I'm not sure how I can average them.
I feel like there must be a neater solution in data.table. Any help would be much appreciated!
Upvotes: 2
Views: 58
Reputation: 6529
You can solve your problem using the data.tabe's function dcast
as shown below:
For the creation rank-related columns in rank_cols
below, I borrowed the code from @langtang's answer.
rank_cols = paste0(metric_cols, "rank")
dummy_data[, (rank_cols) := lapply(.SD, rank), .(sub_model, year), .SDcols=metric_cols]
Summary table:
dcast(data = dummy_data,
formula = hyperparam_binary+hyperparam_numeric ~ .,
fun = list(m=mean, s=sd),
value.var = c(metric_cols, rank_cols))
Key: <hyperparam_binary, hyperparam_numeric>
hyperparam_binary hyperparam_numeric MAE_m RMSE_m MAErank_m RMSErank_m MAE_s RMSE_s MAErank_s RMSErank_s
<lgcl> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: FALSE 3 2978.239 4563.335 3.543333 3.460000 1126.644 1426.583 1.640251 1.658585
2: FALSE 4 2925.884 4657.478 3.440000 3.663333 1199.488 1543.213 1.750222 1.796500
3: FALSE 5 2895.942 4477.693 3.470000 3.393333 1165.870 1405.294 1.743339 1.633253
4: TRUE 3 2999.622 4550.635 3.523333 3.530000 1201.470 1469.597 1.760622 1.680828
5: TRUE 4 2988.928 4487.522 3.543333 3.393333 1120.362 1520.403 1.640251 1.797049
6: TRUE 5 2918.992 4591.984 3.480000 3.560000 1194.360 1469.488 1.722253 1.674039
Upvotes: 1
Reputation: 24830
You can try something like this:
f <- \(x) lapply(x, \(i) data.table(m = mean(i), s= sd(i)))
dummy_data[, do.call(cbind, f(.SD)), hyperparameter_cols,.SDcols = metric_cols]
Output:
hyperparam_binary hyperparam_numeric MAE.m MAE.s RMSE.m RMSE.s
<lgcl> <num> <num> <num> <num> <num>
1: FALSE 3 2978.239 1126.644 4563.335 1426.583
2: FALSE 4 2925.884 1199.488 4657.478 1543.213
3: FALSE 5 2895.942 1165.870 4477.693 1405.294
4: TRUE 3 2999.622 1201.470 4550.635 1469.597
5: TRUE 4 2988.928 1120.362 4487.522 1520.403
6: TRUE 5 2918.992 1194.360 4591.984 1469.488
For your second question about the ranks, you can assign those columns directly to dummy_data
using :=
. Specifically, do something like this:
rank_cols = paste0(metric_cols, "rank")
dummy_data[
,
c(rank_cols):=lapply(.SD, rank),
.(sub_model, year),
.SDcols = metric_cols
]
You could then include these rank_cols
when you call f()
, or you could work on them separately. An example of the former is:
dummy_data[, do.call(cbind, f(.SD)), hyperparameter_cols,.SDcols = c(metric_cols, rank_cols)]
Output:
hyperparam_binary hyperparam_numeric MAE.m MAE.s RMSE.m RMSE.s MAErank.m MAErank.s RMSErank.m RMSErank.s
<lgcl> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: FALSE 3 2978.239 1126.644 4563.335 1426.583 3.543333 1.640251 3.460000 1.658585
2: FALSE 4 2925.884 1199.488 4657.478 1543.213 3.440000 1.750222 3.663333 1.796500
3: FALSE 5 2895.942 1165.870 4477.693 1405.294 3.470000 1.743339 3.393333 1.633253
4: TRUE 3 2999.622 1201.470 4550.635 1469.597 3.523333 1.760622 3.530000 1.680828
5: TRUE 4 2988.928 1120.362 4487.522 1520.403 3.543333 1.640251 3.393333 1.797049
6: TRUE 5 2918.992 1194.360 4591.984 1469.488 3.480000 1.722253 3.560000 1.674039
An example of the latter is:
dummy_data[, lapply(.SD, mean), hyperparameter_cols, .SDcols = rank_cols]
Output:
hyperparam_binary hyperparam_numeric MAErank RMSErank
<lgcl> <num> <num> <num>
1: FALSE 3 3.543333 3.460000
2: FALSE 4 3.440000 3.663333
3: FALSE 5 3.470000 3.393333
4: TRUE 3 3.523333 3.530000
5: TRUE 4 3.543333 3.393333
6: TRUE 5 3.480000 3.560000
Upvotes: 3