shwan
shwan

Reputation: 570

r dplyr and data.table: aggregate then join back to original table

I frequently find myself calculating summary statistics of data frames using the following dplyr syntax:

1. Aggregate <- 
2. Original Dataset %>%
3. Group_By %>%
4. Filter %>% 
5. Summarize %>%
6. Left_Join(back to Aggregate)

For example:

Original <- data.frame(A = 1:100,B = sample(LETTERS,100,replace = TRUE),C = rnorm(100))

# Calculate 1st Summary Statistic
Aggregate <- Original %>% group_by(B) %>% 
             filter(A > 50) %>% 
             summarize(meanC = mean(C))

# Calculate 2nd Summary Statistic
Aggregate <- Original %>% group_by(B) %>% 
             summarize(Q = sum(C)) %>%
             left_join(x = Aggregate,y = Original,by = "B")

My question is two fold:

A) Is there a better way of building a table of summary statistics off another table? The left-join feels very clunky.

B) What is the "data.table" way of doing this, i.e., how do I join back to the Aggregate table?

Aggregate[Aggregate[,meanC:=mean(C),by=.(B)]]

Thanks for any advice...

Upvotes: 5

Views: 2072

Answers (2)

sonshine
sonshine

Reputation: 591

I believe DanY pointed most of this out in the comments but posting for completeness sake. You don't really need to do any merging w/ data.table:

# data
Original <- data.frame(A = 1:100, B = sample(LETTERS, 100, replace = TRUE), C = rnorm(100))

# to data.table
dt <- data.table::as.data.table(Original)

# first summary
dt[A > 50, meanC := mean(C)]

# second summary
dt[, Q := sum(C), by = B]

Though if you did want to save off those table, you certainly can:

# to data.table
dt <- data.table::as.data.table(Original)

# first aggreggate
agg_a <- dt[A > 50, .(A, meanC = mean(C))]

# second aggregate
agg_b <- dt[, .(Q = sum(C)), by = B]

# merge
dt <- data.table::merge.data.table(dt, agg_a, by = "A", all.x = TRUE)
dt <- data.table::merge.data.table(dt, agg_b, by = "B", all.x = TRUE)

Upvotes: 0

Jon Spring
Jon Spring

Reputation: 66415

You can avoid the join if you mutate after the group_by instead of summarize.

library(tidyverse)
Original <- data.frame(A = 1:100,B = sample(LETTERS,100,replace = TRUE),C = rnorm(100))

# Calculate unfiltered summary statistic, as in OP
Aggregate_OP <- Original %>% 
  group_by(B) %>% 
  summarize(meanC = mean(C)) %>% 
  right_join(Original) %>% 
  select(A, B, C, meanC) # reorder columns
#> Joining, by = "B"

# Simpler, using mutate
Aggregate_mutate <- Original %>% 
  group_by(B) %>% 
  mutate(meanC = mean(C)) %>% 
  ungroup()

# with dplyr 1.1.0+, we can do:
Aggregate_mutate <- Original %>% 
  mutate(meanC = mean(C), .by = B)


identical(Aggregate_OP |> arrange(A), tibble(Aggregate_mutate))
#> [1] TRUE

UPDATE -- to calculate a filtered stat like in the OP:

Original %>% 
  mutate(meanC = mean(C[A > 50]), .by = B)

Upvotes: 4

Related Questions