Reputation: 570
I frequently find myself calculating summary statistics of data frames using the following dplyr syntax:
1. Aggregate <-
2. Original Dataset %>%
3. Group_By %>%
4. Filter %>%
5. Summarize %>%
6. Left_Join(back to Aggregate)
For example:
Original <- data.frame(A = 1:100,B = sample(LETTERS,100,replace = TRUE),C = rnorm(100))
# Calculate 1st Summary Statistic
Aggregate <- Original %>% group_by(B) %>%
filter(A > 50) %>%
summarize(meanC = mean(C))
# Calculate 2nd Summary Statistic
Aggregate <- Original %>% group_by(B) %>%
summarize(Q = sum(C)) %>%
left_join(x = Aggregate,y = Original,by = "B")
My question is two fold:
A) Is there a better way of building a table of summary statistics off another table? The left-join feels very clunky.
B) What is the "data.table" way of doing this, i.e., how do I join back to the Aggregate table?
Aggregate[Aggregate[,meanC:=mean(C),by=.(B)]]
Thanks for any advice...
Upvotes: 5
Views: 2072
Reputation: 591
I believe DanY pointed most of this out in the comments but posting for completeness sake. You don't really need to do any merging w/ data.table
:
# data
Original <- data.frame(A = 1:100, B = sample(LETTERS, 100, replace = TRUE), C = rnorm(100))
# to data.table
dt <- data.table::as.data.table(Original)
# first summary
dt[A > 50, meanC := mean(C)]
# second summary
dt[, Q := sum(C), by = B]
Though if you did want to save off those table, you certainly can:
# to data.table
dt <- data.table::as.data.table(Original)
# first aggreggate
agg_a <- dt[A > 50, .(A, meanC = mean(C))]
# second aggregate
agg_b <- dt[, .(Q = sum(C)), by = B]
# merge
dt <- data.table::merge.data.table(dt, agg_a, by = "A", all.x = TRUE)
dt <- data.table::merge.data.table(dt, agg_b, by = "B", all.x = TRUE)
Upvotes: 0
Reputation: 66415
You can avoid the join if you mutate after the group_by instead of summarize.
library(tidyverse)
Original <- data.frame(A = 1:100,B = sample(LETTERS,100,replace = TRUE),C = rnorm(100))
# Calculate unfiltered summary statistic, as in OP
Aggregate_OP <- Original %>%
group_by(B) %>%
summarize(meanC = mean(C)) %>%
right_join(Original) %>%
select(A, B, C, meanC) # reorder columns
#> Joining, by = "B"
# Simpler, using mutate
Aggregate_mutate <- Original %>%
group_by(B) %>%
mutate(meanC = mean(C)) %>%
ungroup()
# with dplyr 1.1.0+, we can do:
Aggregate_mutate <- Original %>%
mutate(meanC = mean(C), .by = B)
identical(Aggregate_OP |> arrange(A), tibble(Aggregate_mutate))
#> [1] TRUE
UPDATE -- to calculate a filtered stat like in the OP:
Original %>%
mutate(meanC = mean(C[A > 50]), .by = B)
Upvotes: 4