How to calculate new column depending on aggregate function on group using dplyr (add summary statistics on the summary statistics)?

Question

Quite often I need to calculate a new column for an R dataframe (in long form), whose value should depend on an aggregate function (e.g. sum) of a group. For instance, I might want to know what fraction of sales a product accounts for on any given day:

daily fraction = revenue for product i on day d / sum or revenue for all products on day d

My current strategy is to summarise and join:

library(dplyr)

join_summary <- function(data, ...) left_join(data, summarise(data, ...))

data = data.frame(
  day = c(1,1,2,2,3,3),
  product = rep(c("A", "B"), 3),
  revenue = c(2, 4, 8, 7, 9, 2)
)

data2 <- data %>%
  group_by(day) %>%
  join_summary(daily_revenue = sum(revenue)) %>%
  mutate(revenue_fraction = revenue / daily_revenue)

This works, but I am not sure if it is an anti-pattern. It seems a bit inefficient to repeat the same data (the daily revenue) on multiple rows, littering my data frame with aggregates. My questions are:

Is my current way of doing it OK?
Is there any better way of doing it, preferably using tools from dplyr or the wider Hadleyverse?
Do I really need my custom function join_summary, or can it be done with the existing dplyr verbs? (I prefer to stay within the "pipe flow", though.)

Frank · Accepted Answer

Besides using summarise to aggregate, you can use mutate to assign summary stats to the full column:

data %>% 
  group_by(day) %>% 
  mutate(
    daily_revenue = sum(revenue), 
    revenue_fraction = revenue / daily_revenue
  )

which gives

Source: local data frame [6 x 5]
Groups: day [3]

    day product revenue daily_revenue revenue_fraction
                            
1     1       A       2             6        0.3333333
2     1       B       4             6        0.6666667
3     2       A       8            15        0.5333333
4     2       B       7            15        0.4666667
5     3       A       9            11        0.8181818
6     3       B       2            11        0.1818182

This works because the value sum(revenue) is recycled to fill in all rows in the group.

How to calculate new column depending on aggregate function on group using dplyr (add summary statistics on the summary statistics)?

Answers (1)

Related Questions