Reputation: 641
A simplified version of my data looks like this:
primary_list <- list(secondary_list = list(df1 = data.frame(df_name = "hello", time = c(1,2,3,6,7,9,12,20), v1 = sample(0:1, 8, replace = T)),
df2 = data.frame(df_name = "world", time = c(1,5,6,8,9,12,15,18,19,20), v1 = sample(0:1, 10, replace = T))),
secondary_list2 = list(df1 = data.frame(df_name = "hello", time = c(1,5,6,8,9,12,15,18,19,20), v1 = sample(0:1, 10, replace = T)),
df2 = data.frame(df_name = "world", time = c(4,6,7,8,12), v1 = sample(0:1, 5, replace = T))))
My objective is to full_join
all the dataframes in the secondary_lists (i.e., the ones with a common element name and df_name
identifier) to a separate dataframe by 'time'. The aim is similar to this question but using a separate dataframe as the argument to full_join
.
as a manual example, I'm trying to achieve something like this:
time_df <- data.frame(time = 1:25)
expected_output <- list(
secondary_list = list(time_df, primary_list$secondary_list$df1, primary_list$secondary_list2$df1) %>% reduce(full_join, by = "time"),
secondary_list2 = list(time_df, primary_list$secondary_list$df2, primary_list$secondary_list2$df2) %>% reduce(full_join, by = "time")
)
### Ideally I'm aiming for something cleaned up like this:
ideal_output <- lapply(expected_output, function(x){
x <- x %>% select(time,v1.x,v1.y)
colnames(x) <- c("time", "v1", "v2")
x
})
### I can probably handle that after
I'm trying to make a solution that works with a much larger list structure and computes the full_join
for all the secondary_list elements
I've tried flattening the list like this but run into some complications indexing list elements sharing the same name. Also tried rbindlist
to squash all the secondary lists into one dataframe but no joy there yet.
Solution does not have to be dplyr
, just needs to work for n secondary_lists containing n dataframes of different lengths. For bonous points it would be great if the output does not repeat df_name.x
, df_name.y
as only one identifier column is necessary followed by the variables (see ideal_output
) but I can clean that up afterwards easily enough.
Thanks in advance!
UPDATE I got three great answers to the question.
Ronak's is works great in a tidyverse pipeline workflow
ekoam's is a one liner and I'm all for succinct code
neon_ninja wins on the microbenchmark
Since I have a huge amount of data, speed is important so I'll accept neon_ninja's answer. Thanks all!
mbm = microbenchmark(
#####
ronak = primary_list %>%
purrr::transpose() %>%
map(~{
.x$time_df <- time_df
.x %>%
reduce(full_join, by = 'time') %>%
select(time, df_name.x, starts_with("v")) ### Edited select fo n dfs
}),
#####
ekoam = lapply(primary_list, compose(~select(., time, v1 = v1.x, v2 = v1.y), reduce), full_join, .init = time_df, by = "time"),
#####
neon_ninja = lapply(names(primary_list[[1]]), function(df_name) {
time_df["df_name"] = df_name ### I added this to get a column of names
time_df = data.frame(time = 1:25)
for (sec_list_name in names(primary_list)) {
time_df[sec_list_name] = NA
df = primary_list[[sec_list_name]][[df_name]]
for (i in 1:length(df$time)) {
time_df[time_df$time == df$time[i], sec_list_name] = df$v1[i]
}
}
time_df
})
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# ronak 11.572526 12.507490 15.132911 13.491643 16.531374 28.11236 100
# ekoam 10.710731 11.464450 13.665370 12.450013 14.204986 26.83166 100
# neon_ninja 1.116033 1.258137 1.784428 1.345762 1.643547 13.89803 100
Upvotes: 0
Views: 347
Reputation: 61
Here's how I would solve this:
lapply(names(primary_list[[1]]), function(df_name) {
time_df = data.frame(time = 1:25)
for (sec_list_name in names(primary_list)) {
time_df[sec_list_name] = NA
df = primary_list[[sec_list_name]][[df_name]]
for (i in 1:length(df$time)) {
time_df[time_df$time == df$time[i], sec_list_name] = df$v1[i]
}
}
time_df
})
If you need to find out your time range, here's a one liner to do that:
range(unlist(lapply(unlist(primary_list, F, F), function(df) { df$time })))
Similarly if you need to find out df names:
unique(unlist(lapply(primary_list, names)))
Upvotes: 1
Reputation: 8844
How about this?
c(list(time_df), unlist(primary_list, False, False)) %>% reduce(full_join, by = "time")
Update
lapply(primary_list, compose(~select(., time, v1 = v1.x, v2 = v1.y), reduce), full_join, .init = time_df, by = "time")
Output
$secondary_list
time v1 v2
1 1 1 1
2 2 0 NA
3 3 1 NA
4 4 NA NA
5 5 NA 1
6 6 1 0
7 7 1 NA
8 8 NA 1
9 9 0 0
10 10 NA NA
11 11 NA NA
12 12 0 0
13 13 NA NA
14 14 NA NA
15 15 NA 0
16 16 NA NA
17 17 NA NA
18 18 NA 0
19 19 NA 0
20 20 1 1
21 21 NA NA
22 22 NA NA
23 23 NA NA
24 24 NA NA
25 25 NA NA
$secondary_list2
time v1 v2
1 1 1 NA
2 2 NA NA
3 3 NA NA
4 4 NA 1
5 5 0 NA
6 6 0 1
7 7 NA 1
8 8 1 1
9 9 1 NA
10 10 NA NA
11 11 NA NA
12 12 1 1
13 13 NA NA
14 14 NA NA
15 15 1 NA
16 16 NA NA
17 17 NA NA
18 18 1 NA
19 19 0 NA
20 20 0 NA
21 21 NA NA
22 22 NA NA
23 23 NA NA
24 24 NA NA
25 25 NA NA
Upvotes: 1
Reputation: 389055
You can use purrr::transpose
to get all the similar named dataframes in one list and then include time_df
as a new dataframe in each list and do a full_join
.
library(dplyr)
library(purrr)
primary_list %>%
purrr::transpose() %>%
map(~{
.x$time_df <- time_df
.x %>%
reduce(full_join, by = 'time') %>%
select(time, v1 = v1.x, v2 = v1.y)
})
Upvotes: 2