Reputation: 1926
I am running a simulation study where my results are stored in a nested list structure. The first level of the list represents the different hyperparameters generated by the model. The second level is the number of replications of the same model (changing the seed).
In the example below, I list the output of a model that is governed by two hyperparameters (hyperpar1
and hyperpar2
) where both can take 2 different values, leading to 4 different combinations of the resulting model. Additionally, each of the 4 possible combinations was run twice (different seeds), leading to eight possible combinations (as can be seen below with str(res, max = 2)
). Finally, two performance metrics were recovered from each possible iteration of the models (metric1
and metric2
).
My problem is that in my real data, the number of iterations (second level of the list) is huge (up to 10000), and the full factorial of the number of hyperparameters is up to 2000 in some cases. Therefore the unlisting process becomes rather slow.
Below I list my current procedure with my desired output, but again, it is relatively slow. In particular, there is one part when I unlist everything, and I put it together in one big data.frame that takes way too long, but I haven't solved this in a faster way.
res <-list(
list(list(modeltype = "tree", time_iter = structure(0.7099, class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 0.5, metric1 = 0.4847, metric2 = 0.2576 ),
list(modeltype = "tree", time_iter = structure(0.058 , class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 0.5, metric1 = 0.4013, metric2 = 0.2569 )),
list(list(modeltype = "tree", time_iter = structure(0.046 , class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 0.5, metric1 = 0.4755, metric2 = 0.2988 ),
list(modeltype = "tree", time_iter = structure(0.0474, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 0.5, metric1 = 0.2413, metric2 = 0.2147 )),
list(list(modeltype = "tree", time_iter = structure(0.0502, class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 1 , metric1 = 0.7131, metric2 = 0.5024 ),
list(modeltype = "tree", time_iter = structure(2.9419, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 1 , metric1 = 0.4254, metric2 = 0.2824 )),
list(list(modeltype = "tree", time_iter = structure(0.041 , class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 1 , metric1 = 0.6709, metric2 = 0.4092 ),
list(modeltype = "tree", time_iter = structure(0.0396, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 1 , metric1 = 0.4585, metric2 = 0.4115 )))
hyperpar1 <- c(0.5 , 0.8 )
hyperpar2 <- c(0.5 , 1 )
expand.grid(hyperpar1 = hyperpar1, hyperpar2 = hyperpar2)
# hyperpar1 hyperpar2
# 1 0.5 0.5
# 2 0.8 0.5
# 3 0.5 1.0
# 4 0.8 1.0
#List structure:
#The 4 elements represent the 4 combinations of the hyperparams
#Inside each of the 4 combinations of the hyperparams, 2 lists represent the 2 simulations (with different seeds)
str(res, max = 1)
#Finally, inside each of the final level (level=3) there is a list of 10 objects that are the results of each simulation
str(res, max = 2)
# List of 4
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
#e.g Fist iteration of first model
t(res[[1]][[1]])
# modeltype time_iter seed nobs hyperpar1 hyperpar2 metric1 metric2
# [1,] "tree" 0.7099 1 75 0.5 0.5 0.4847 0.2576
In the following code, I unnest the list and put everything into a data.frame.
#Unlist the nested structure of the list `res`
all_in_list <- lapply(1:length(res), function(i) {
unlisting <- unlist(res[i],recursive = FALSE)
to_df <- do.call(rbind, lapply(unlisting, as.data.frame))
return(to_df)})
#Here is where averything get really really slow when the list is huge
all_in_df <- do.call(rbind, lapply(all_in_list, as.data.frame))
# modeltype time_iter seed nobs hyperpar1 hyperpar2 metric1 metric2
# 1 tree 0.7099 secs 1 75 0.5 0.5 0.4847 0.2576
# 2 tree 0.0580 secs 2 75 0.5 0.5 0.4013 0.2569
# 3 tree 0.0460 secs 1 75 0.8 0.5 0.4755 0.2988
# 4 tree 0.0474 secs 2 75 0.8 0.5 0.2413 0.2147
# 5 tree 0.0502 secs 1 75 0.5 1.0 0.7131 0.5024
# 6 tree 2.9419 secs 2 75 0.5 1.0 0.4254 0.2824
# 7 tree 0.0410 secs 1 75 0.8 1.0 0.6709 0.4092
# 8 tree 0.0396 secs 2 75 0.8 1.0 0.4585 0.4115
In the following, I recovered the mean and the standard deviation of the performance metrics adding their subfix to each colname
to identify it later (graphing purposes).
#auxiliar function to compute the metrics at aggregate level.
foo_summary <- function(df ,
metrics =c("time_iter","metric1", "metric2") ,
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = mean)
{
#compute the aggregate metrics
out <- as.data.frame(aggregate(
x = df[metrics],
by = df[by],
FUN = summary_function,
na.rm = TRUE))
#rename conviniently the metric computed
oldnames <- colnames(out[metrics])
names(out)[match(oldnames,names(out))] <- paste(colnames(out[metrics]),
as.character(substitute(summary_function)),
sep = "_")
return(out)
}
df_mean <- foo_summary(df = all_in_df,
metrics =c("time_iter","metric1", "metric2"),
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = mean)
df_sd <- foo_summary(df = all_in_df,
metrics =c("time_iter","metric1", "metric2"),
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = sd)
final_df <- merge(df_mean,df_sd )
# nobs hyperpar1 hyperpar2 modeltype time_iter_mean metric1_mean metric2_mean time_iter_sd metric1_sd metric2_sd
# 1 75 0.5 0.5 tree 0.38395 secs 0.44300 0.25725 0.4609629107 0.05897271 0.0004949747
# 2 75 0.5 1.0 tree 1.49605 secs 0.56925 0.39240 2.0447406792 0.20343462 0.1555634919
# 3 75 0.8 0.5 tree 0.04670 secs 0.35840 0.25675 0.0009899495 0.16560441 0.0594676803
# 4 75 0.8 1.0 tree 0.04030 secs 0.56470 0.41035 0.0009899495 0.15018948 0.0016263456
Upvotes: 1
Views: 402
Reputation: 6234
Using dplyr::bind_rows()
, the nested list is directly unnested as a data.frame after which it is straightforward to calculate the summary statistics:
library(dplyr)
bind_rows(res) %>%
group_by(modeltype, nobs, hyperpar1, hyperpar2) %>%
summarize(across(everything(), list(mean = mean, sd = sd)), .groups = "drop")
#> # A tibble: 4 x 12
#> modeltype nobs hyperpar1 hyperpar2 time_iter_mean time_iter_sd seed_mean
#> <chr> <dbl> <dbl> <dbl> <drtn> <dbl> <dbl>
#> 1 tree 75 0.5 0.5 0.38395 secs 0.461 1.5
#> 2 tree 75 0.5 1 1.49605 secs 2.04 1.5
#> 3 tree 75 0.8 0.5 0.04670 secs 0.000990 1.5
#> 4 tree 75 0.8 1 0.04030 secs 0.000990 1.5
#> # … with 5 more variables: seed_sd <dbl>, metric1_mean <dbl>, metric1_sd <dbl>,
#> # metric2_mean <dbl>, metric2_sd <dbl>
Upvotes: 1
Reputation: 17785
You could try data.table
:
library(data.table)
tmp = data.table(res)
tmp = tmp[, t(res[1]), by=1:nrow(tmp)]
tmp = tmp[, V1[[1]], by=1:nrow(tmp)]
g = function(x) list(mean = mean(x), sd = sd(x))
tmp[, unlist(lapply(.SD, g), recursive=FALSE)
, .SDcols=hyperpar1:metric2,
, by=.(nobs, hyperpar1, hyperpar2, modeltype)]
#> nobs hyperpar1 hyperpar2 modeltype hyperpar1.mean hyperpar1.sd
#> 1: 75 0.5 0.5 tree 0.5 0
#> 2: 75 0.8 0.5 tree 0.8 0
#> 3: 75 0.5 1.0 tree 0.5 0
#> 4: 75 0.8 1.0 tree 0.8 0
#> hyperpar2.mean hyperpar2.sd metric1.mean metric1.sd metric2.mean
#> 1: 0.5 0 0.44300 0.05897271 0.25725
#> 2: 0.5 0 0.35840 0.16560441 0.25675
#> 3: 1.0 0 0.56925 0.20343462 0.39240
#> 4: 1.0 0 0.56470 0.15018948 0.41035
#> metric2.sd
#> 1: 0.0004949747
#> 2: 0.0594676803
#> 3: 0.1555634919
#> 4: 0.0016263456
This code uses successive unnesting of list-columns, a strategy which I have described in this notebook: http://arelbundock.com/posts/datatable_nesting/
Upvotes: 2