Reputation: 16856
I want to combine/reduce a list of dataframes into one dataframe, but I also want to summarize the data in one step. The output is from a simulation; therefore, each dataframe has the same output structure (i.e., a Group column, then 2 columns with values, which will have values that vary for each output).
Minimal Reproducible Example
df_list <- list(structure(list(Group = c("A", "B", "C"), Top_Group = c(1L,
0L, 0L), Efficiency = c(0.464688158128411, 0.652386676520109,
0.282913417555392)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), structure(list(Group = c("A", "B", "C"
), Top_Group = c(0L, 1L, 0L), Efficiency = c(0.120292583014816,
0.0356206290889531, 0.37196880299598)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame")), structure(list(
Group = c("A", "B", "C"), Top_Group = c(0L, 1L, 0L), Efficiency = c(0.261322160949931,
0.383351784432307, 0.754808459430933)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame")))
What I Have Tried
I know I could bind the data together, then group and summarize.
library(tidyverse)
df_list %>%
bind_rows() %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group), Efficiency = max(Efficiency))
# Group Top_Group Efficiency
# <chr> <int> <dbl>
#1 A 1 0.465
#2 B 2 0.652
#3 C 0 0.755
I was hoping that there was someway to use something like reduce
; however, I can only get it to work for pulling out one column (like Top_Group
shown here), and am unsure how to use across all columns (if possible) and return a dataframe instead of vectors.
df_list %>%
map(2) %>%
reduce(`+`)
# [1] 1 2 0
Expected Output
Group Top_Group Efficiency
<chr> <int> <dbl>
1 A 1 0.465
2 B 2 0.652
3 C 0 0.755
Upvotes: 4
Views: 388
Reputation: 5788
Another base R, a few months late:
subset(
within(
do.call(rbind, df_list),
{
Top_Group <- ave(Top_Group, Group, FUN = sum)
Efficiency <- ave(Efficiency, Group, FUN = max)
}
),
!(duplicated(Group))
)
Upvotes: 1
Reputation: 16856
Another option is using data.table
, where we can use rbindlist
, then summarize the columns.
library(data.table)
rbindlist(df_list)[, list(Top_Group = sum(Top_Group),
Efficiency = max(Efficiency)), by = .(Group)]
Output
Group Top_Group Efficiency
1: A 1 0.4646882
2: B 2 0.6523867
3: C 0 0.7548085
Benchmark
Just out of curiosity (as this question is not about efficiency), I also ran all the current answers to see what is the fastest. The base R options are fast, but apparently the data.table
option is the fastest.
Code
microbenchmark::microbenchmark(akrun = reduce(df_list, ~ tibble(.x[1], .x[2] + .y[2], pmax(.x[3], .y[3]))),
AllanCameron = Reduce(function(a, b) cbind(a[1], a[2] + b[2], pmax(a[3], b[3])), df_list),
ThomasIsCoding_agg_ave = {aggregate(
. ~ Group,
transform(
do.call(
rbind,
df_list
),
Efficiency = ave(
Efficiency,
Group,
FUN = function(x) max(x) / length(x)
)
), sum
)},
ThomasIsCoding_agg_sapply = {transform(
aggregate(. ~ Group, do.call(rbind, df_list), list),
Top_Group = sapply(Top_Group, sum),
Efficiency = sapply(Efficiency, max)
)
},
deschen = df_list %>%
reduce(full_join, by = "Group") %>%
rowwise() %>%
summarize(Group = Group,
Top_Group = sum(c_across(starts_with("Top_Group"))),
Efficiency = max(c_across(starts_with("Efficiency")))) %>%
ungroup(),
TomHoel = df_list %>%
tibble() %>%
unnest(cols = c(.)) %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group), Efficiency = max(Efficiency)),
AndrewGB_tidyverse = df_list %>%
bind_rows() %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group), Efficiency = max(Efficiency)),
AndrewGB_datatable = rbindlist(df_list)[, list(Top_Group = sum(Top_Group), Efficiency = max(Efficiency)), by=.(Group)],
times = 2000
)
Upvotes: 2
Reputation: 101343
A base R option using aggregate
+ ave
aggregate(
. ~ Group,
transform(
do.call(
rbind,
df_list
),
Efficiency = ave(
Efficiency,
Group,
FUN = function(x) max(x) / length(x)
)
), sum
)
or aggregate
+ sapply
transform(
aggregate(. ~ Group, do.call(rbind, df_list), list),
Top_Group = sapply(Top_Group, sum),
Efficiency = sapply(Efficiency, max)
)
gives
Group Top_Group Efficiency
1 A 1 0.4646882
2 B 2 0.6523867
3 C 0 0.7548085
Upvotes: 3
Reputation: 173813
In base R you could just do
Reduce(function(a, b) cbind(a[1], a[2] + b[2], pmax(a[3], b[3])), df_list)
#> Group Top_Group Efficiency
#> 1 A 1 0.4646882
#> 2 B 2 0.6523867
#> 3 C 0 0.7548085
Upvotes: 4
Reputation: 6563
You almost had it! Check out ?unnest()
require(tidyverse)
df_list %>%
tibble() %>%
unnest(cols = c(.)) %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group), Efficiency = max(Efficiency))
# A tibble: 3 x 3
Group Top_Group Efficiency
<chr> <int> <dbl>
1 A 1 0.465
2 B 2 0.652
3 C 0 0.755
Upvotes: 1
Reputation: 10996
Yet another solution with reduce
, fulljoin
, and then a rowwise
summarize
:
library(tidyverse)
df_list %>%
reduce(full_join, by = "Group") %>%
rowwise() %>%
summarize(Group = Group,
Top_Group = sum(c_across(starts_with("Top_Group"))),
Efficiency = max(c_across(starts_with("Efficiency")))) %>%
ungroup()
# A tibble: 3 x 3
Group Top_Group Efficiency
<chr> <int> <dbl>
1 A 1 0.465
2 B 2 0.652
3 C 0 0.755
Upvotes: 3
Reputation: 887118
Based on the OP's code, different functions were used on different columns. So, we may have to individually apply those elementwise functions
library(purrr)
reduce(df_list, ~ tibble(.x[1], .x[2] + .y[2], pmax(.x[3], .y[3])))
-output
# A tibble: 3 × 3
Group Top_Group Efficiency
<chr> <int> <dbl>
1 A 1 0.465
2 B 2 0.652
3 C 0 0.755
Upvotes: 2