Efficient summary statistics of nested huge list in R

I am running a simulation study where my results are stored in a nested list structure. The first level of the list represents the different hyperparameters generated by the model. The second level is the number of replications of the same model (changing the seed).

In the example below, I list the output of a model that is governed by two hyperparameters (hyperpar1 and hyperpar2) where both can take 2 different values, leading to 4 different combinations of the resulting model. Additionally, each of the 4 possible combinations was run twice (different seeds), leading to eight possible combinations (as can be seen below with str(res, max = 2)). Finally, two performance metrics were recovered from each possible iteration of the models (metric1 and metric2).

My problem is that in my real data, the number of iterations (second level of the list) is huge (up to 10000), and the full factorial of the number of hyperparameters is up to 2000 in some cases. Therefore the unlisting process becomes rather slow.

Below I list my current procedure with my desired output, but again, it is relatively slow. In particular, there is one part when I unlist everything, and I put it together in one big data.frame that takes way too long, but I haven't solved this in a faster way.

res <-list(
  list(list(modeltype = "tree", time_iter = structure(0.7099, class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 0.5, metric1 = 0.4847, metric2 = 0.2576 ),
       list(modeltype = "tree", time_iter = structure(0.058 , class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 0.5, metric1 = 0.4013, metric2 = 0.2569 )), 
  list(list(modeltype = "tree", time_iter = structure(0.046 , class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 0.5, metric1 = 0.4755, metric2 = 0.2988 ), 
       list(modeltype = "tree", time_iter = structure(0.0474, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 0.5, metric1 = 0.2413, metric2 = 0.2147 )), 
  list(list(modeltype = "tree", time_iter = structure(0.0502, class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 1  , metric1 = 0.7131, metric2 = 0.5024 ), 
       list(modeltype = "tree", time_iter = structure(2.9419, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 1  , metric1 = 0.4254, metric2 = 0.2824 )), 
  list(list(modeltype = "tree", time_iter = structure(0.041 , class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 1  , metric1 = 0.6709, metric2 = 0.4092 ), 
       list(modeltype = "tree", time_iter = structure(0.0396, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 1  , metric1 = 0.4585, metric2 = 0.4115 )))


hyperpar1   <-   c(0.5 , 0.8 ) 
hyperpar2   <-   c(0.5 , 1   )
expand.grid(hyperpar1 = hyperpar1, hyperpar2 = hyperpar2)

#   hyperpar1 hyperpar2
# 1       0.5       0.5
# 2       0.8       0.5
# 3       0.5       1.0
# 4       0.8       1.0

#List structure:
#The 4 elements represent the 4 combinations of the hyperparams
#Inside each of the 4 combinations of the hyperparams, 2 lists represent the 2 simulations (with different seeds)
str(res, max = 1)
#Finally, inside each of the final level (level=3) there is a list of 10 objects that are the results of each simulation
str(res, max = 2)

# List of 4
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8

#e.g Fist iteration of first model 
t(res[[1]][[1]])
# modeltype time_iter seed nobs hyperpar1 hyperpar2 metric1 metric2
# [1,] "tree"    0.7099    1    75   0.5       0.5       0.4847  0.2576 


Parsing the list

In the following code, I unnest the list and put everything into a data.frame.

#Unlist the nested structure of the list `res`
all_in_list <- lapply(1:length(res), function(i) {
     unlisting <- unlist(res[i],recursive = FALSE)
     to_df <- do.call(rbind, lapply(unlisting, as.data.frame))
     return(to_df)})

#Here is where averything get really really slow when the list is huge
all_in_df <- do.call(rbind, lapply(all_in_list, as.data.frame))

#   modeltype   time_iter seed nobs hyperpar1 hyperpar2 metric1 metric2
# 1      tree 0.7099 secs    1   75       0.5       0.5  0.4847  0.2576
# 2      tree 0.0580 secs    2   75       0.5       0.5  0.4013  0.2569
# 3      tree 0.0460 secs    1   75       0.8       0.5  0.4755  0.2988
# 4      tree 0.0474 secs    2   75       0.8       0.5  0.2413  0.2147
# 5      tree 0.0502 secs    1   75       0.5       1.0  0.7131  0.5024
# 6      tree 2.9419 secs    2   75       0.5       1.0  0.4254  0.2824
# 7      tree 0.0410 secs    1   75       0.8       1.0  0.6709  0.4092
# 8      tree 0.0396 secs    2   75       0.8       1.0  0.4585  0.4115

Summary statistics of the output

In the following, I recovered the mean and the standard deviation of the performance metrics adding their subfix to each colname to identify it later (graphing purposes).

#auxiliar function to compute the metrics at aggregate level. 
foo_summary <- function(df ,
                        metrics =c("time_iter","metric1", "metric2") ,
                        by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
                        summary_function =  mean)
{
#compute the aggregate metrics  
out <-  as.data.frame(aggregate(
    x = df[metrics],
    by = df[by],
    FUN = summary_function, 
    na.rm = TRUE))
#rename conviniently the metric computed
oldnames <- colnames(out[metrics])
names(out)[match(oldnames,names(out))] <- paste(colnames(out[metrics]), 
                                                as.character(substitute(summary_function)),
                                                sep = "_")
return(out)
}



df_mean <- foo_summary(df = all_in_df, 
            metrics =c("time_iter","metric1", "metric2"),
            by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
            summary_function =  mean)


df_sd <- foo_summary(df = all_in_df, 
            metrics =c("time_iter","metric1", "metric2"),
            by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
            summary_function =  sd)

final_df <- merge(df_mean,df_sd )

Finally, the desired output.

# nobs hyperpar1 hyperpar2 modeltype time_iter_mean metric1_mean metric2_mean time_iter_sd metric1_sd   metric2_sd
# 1   75       0.5       0.5      tree   0.38395 secs      0.44300      0.25725 0.4609629107 0.05897271 0.0004949747
# 2   75       0.5       1.0      tree   1.49605 secs      0.56925      0.39240 2.0447406792 0.20343462 0.1555634919
# 3   75       0.8       0.5      tree   0.04670 secs      0.35840      0.25675 0.0009899495 0.16560441 0.0594676803
# 4   75       0.8       1.0      tree   0.04030 secs      0.56470      0.41035 0.0009899495 0.15018948 0.0016263456

Upvotes: 1

Views: 402

Answers (2)

Joris C.
Joris C.

Reputation: 6234

Using dplyr::bind_rows(), the nested list is directly unnested as a data.frame after which it is straightforward to calculate the summary statistics:

library(dplyr)

bind_rows(res) %>%
  group_by(modeltype, nobs, hyperpar1, hyperpar2) %>%
  summarize(across(everything(), list(mean = mean, sd = sd)), .groups = "drop")

#> # A tibble: 4 x 12
#>   modeltype  nobs hyperpar1 hyperpar2 time_iter_mean time_iter_sd seed_mean
#>   <chr>     <dbl>     <dbl>     <dbl> <drtn>                <dbl>     <dbl>
#> 1 tree         75       0.5       0.5 0.38395 secs       0.461          1.5
#> 2 tree         75       0.5       1   1.49605 secs       2.04           1.5
#> 3 tree         75       0.8       0.5 0.04670 secs       0.000990       1.5
#> 4 tree         75       0.8       1   0.04030 secs       0.000990       1.5
#> # … with 5 more variables: seed_sd <dbl>, metric1_mean <dbl>, metric1_sd <dbl>,
#> #   metric2_mean <dbl>, metric2_sd <dbl>

Upvotes: 1

Vincent
Vincent

Reputation: 17785

You could try data.table:

library(data.table)

tmp = data.table(res)
tmp = tmp[, t(res[1]), by=1:nrow(tmp)]
tmp = tmp[, V1[[1]], by=1:nrow(tmp)]


g = function(x) list(mean = mean(x), sd = sd(x))
tmp[, unlist(lapply(.SD, g), recursive=FALSE)
    , .SDcols=hyperpar1:metric2,
    , by=.(nobs, hyperpar1, hyperpar2, modeltype)]

#>    nobs hyperpar1 hyperpar2 modeltype hyperpar1.mean hyperpar1.sd
#> 1:   75       0.5       0.5      tree            0.5            0
#> 2:   75       0.8       0.5      tree            0.8            0
#> 3:   75       0.5       1.0      tree            0.5            0
#> 4:   75       0.8       1.0      tree            0.8            0
#>    hyperpar2.mean hyperpar2.sd metric1.mean metric1.sd metric2.mean
#> 1:            0.5            0      0.44300 0.05897271      0.25725
#> 2:            0.5            0      0.35840 0.16560441      0.25675
#> 3:            1.0            0      0.56925 0.20343462      0.39240
#> 4:            1.0            0      0.56470 0.15018948      0.41035
#>      metric2.sd
#> 1: 0.0004949747
#> 2: 0.0594676803
#> 3: 0.1555634919
#> 4: 0.0016263456

This code uses successive unnesting of list-columns, a strategy which I have described in this notebook: http://arelbundock.com/posts/datatable_nesting/

Upvotes: 2

Related Questions