Reputation: 641

How to automatically full_join list elements across lists

A simplified version of my data looks like this:

primary_list <- list(secondary_list = list(df1 = data.frame(df_name = "hello", time = c(1,2,3,6,7,9,12,20), v1 = sample(0:1, 8, replace = T)),
                                           df2 = data.frame(df_name = "world", time = c(1,5,6,8,9,12,15,18,19,20), v1 = sample(0:1, 10, replace = T))),
                                           
                     secondary_list2 = list(df1 = data.frame(df_name = "hello", time = c(1,5,6,8,9,12,15,18,19,20), v1 = sample(0:1, 10, replace = T)),
                                            df2 = data.frame(df_name = "world", time = c(4,6,7,8,12), v1 = sample(0:1, 5, replace = T))))

My objective is to full_join all the dataframes in the secondary_lists (i.e., the ones with a common element name and df_name identifier) to a separate dataframe by 'time'. The aim is similar to this question but using a separate dataframe as the argument to full_join.

as a manual example, I'm trying to achieve something like this:

time_df <- data.frame(time = 1:25)

expected_output <- list(
  secondary_list = list(time_df, primary_list$secondary_list$df1, primary_list$secondary_list2$df1) %>% reduce(full_join, by = "time"),
  secondary_list2 = list(time_df, primary_list$secondary_list$df2, primary_list$secondary_list2$df2) %>% reduce(full_join, by = "time")
)

### Ideally I'm aiming for something cleaned up like this:                

ideal_output <- lapply(expected_output, function(x){
  x <- x %>% select(time,v1.x,v1.y)
  colnames(x) <- c("time", "v1", "v2")
  x
})

### I can probably handle that after

I'm trying to make a solution that works with a much larger list structure and computes the full_join for all the secondary_list elements

I've tried flattening the list like this but run into some complications indexing list elements sharing the same name. Also tried rbindlist to squash all the secondary lists into one dataframe but no joy there yet.

Solution does not have to be dplyr, just needs to work for n secondary_lists containing n dataframes of different lengths. For bonous points it would be great if the output does not repeat df_name.x, df_name.y as only one identifier column is necessary followed by the variables (see ideal_output) but I can clean that up afterwards easily enough.

Thanks in advance!

UPDATE I got three great answers to the question.

Ronak's is works great in a tidyverse pipeline workflow
ekoam's is a one liner and I'm all for succinct code
neon_ninja wins on the microbenchmark

Since I have a huge amount of data, speed is important so I'll accept neon_ninja's answer. Thanks all!

mbm = microbenchmark(
#####
ronak = primary_list %>%
  purrr::transpose() %>%
  map(~{
    .x$time_df <- time_df
    .x %>%
      reduce(full_join, by = 'time') %>% 
      select(time, df_name.x, starts_with("v")) ### Edited select fo n dfs
  }),

#####
ekoam = lapply(primary_list, compose(~select(., time, v1 = v1.x, v2 = v1.y), reduce), full_join, .init = time_df, by = "time"),
#####
neon_ninja = lapply(names(primary_list[[1]]), function(df_name) {
  time_df["df_name"] = df_name ### I added this to get a column of names
  time_df = data.frame(time = 1:25)
  for (sec_list_name in names(primary_list)) {
    time_df[sec_list_name] = NA
    df = primary_list[[sec_list_name]][[df_name]]
    for (i in 1:length(df$time)) {
      time_df[time_df$time == df$time[i], sec_list_name] = df$v1[i]
    }
  }
  time_df
})

)

#Unit: milliseconds
#       expr       min        lq      mean    median        uq      max neval
#      ronak 11.572526 12.507490 15.132911 13.491643 16.531374 28.11236   100
#      ekoam 10.710731 11.464450 13.665370 12.450013 14.204986 26.83166   100
# neon_ninja  1.116033  1.258137  1.784428  1.345762  1.643547 13.89803   100

Upvotes: 0

Answers (3)

neon_ninja

Reputation: 61

Here's how I would solve this:

lapply(names(primary_list[[1]]), function(df_name) {
  time_df = data.frame(time = 1:25)
  for (sec_list_name in names(primary_list)) {
    time_df[sec_list_name] = NA
    df = primary_list[[sec_list_name]][[df_name]]
    for (i in 1:length(df$time)) {
      time_df[time_df$time == df$time[i], sec_list_name] = df$v1[i]
    }
  }
  time_df
})

If you need to find out your time range, here's a one liner to do that:

range(unlist(lapply(unlist(primary_list, F, F), function(df) { df$time })))

Similarly if you need to find out df names:

unique(unlist(lapply(primary_list, names)))

Upvotes: 1

ekoam

Reputation: 8844

How about this?

c(list(time_df), unlist(primary_list, False, False)) %>% reduce(full_join, by = "time")

Update

lapply(primary_list, compose(~select(., time, v1 = v1.x, v2 = v1.y), reduce), full_join, .init = time_df, by = "time")

Output

$secondary_list
   time v1 v2
1     1  1  1
2     2  0 NA
3     3  1 NA
4     4 NA NA
5     5 NA  1
6     6  1  0
7     7  1 NA
8     8 NA  1
9     9  0  0
10   10 NA NA
11   11 NA NA
12   12  0  0
13   13 NA NA
14   14 NA NA
15   15 NA  0
16   16 NA NA
17   17 NA NA
18   18 NA  0
19   19 NA  0
20   20  1  1
21   21 NA NA
22   22 NA NA
23   23 NA NA
24   24 NA NA
25   25 NA NA

$secondary_list2
   time v1 v2
1     1  1 NA
2     2 NA NA
3     3 NA NA
4     4 NA  1
5     5  0 NA
6     6  0  1
7     7 NA  1
8     8  1  1
9     9  1 NA
10   10 NA NA
11   11 NA NA
12   12  1  1
13   13 NA NA
14   14 NA NA
15   15  1 NA
16   16 NA NA
17   17 NA NA
18   18  1 NA
19   19  0 NA
20   20  0 NA
21   21 NA NA
22   22 NA NA
23   23 NA NA
24   24 NA NA
25   25 NA NA

Upvotes: 1

Ronak Shah

Reputation: 389055

You can use purrr::transpose to get all the similar named dataframes in one list and then include time_df as a new dataframe in each list and do a full_join.

library(dplyr)
library(purrr)

primary_list %>%
  purrr::transpose() %>%
  map(~{
    .x$time_df <- time_df
    .x %>%
        reduce(full_join, by = 'time') %>% 
        select(time, v1 = v1.x, v2 = v1.y)
    })

Upvotes: 2

How to automatically full_join list elements across lists

Answers (3)

Related Questions