Anders
Anders

Reputation: 8588

How to calculate new column depending on aggregate function on group using dplyr (add summary statistics on the summary statistics)?

Quite often I need to calculate a new column for an R dataframe (in long form), whose value should depend on an aggregate function (e.g. sum) of a group. For instance, I might want to know what fraction of sales a product accounts for on any given day:

daily fraction = revenue for product i on day d / sum or revenue for all products on day d

My current strategy is to summarise and join:

library(dplyr)

join_summary <- function(data, ...) left_join(data, summarise(data, ...))

data = data.frame(
  day = c(1,1,2,2,3,3),
  product = rep(c("A", "B"), 3),
  revenue = c(2, 4, 8, 7, 9, 2)
)

data2 <- data %>%
  group_by(day) %>%
  join_summary(daily_revenue = sum(revenue)) %>%
  mutate(revenue_fraction = revenue / daily_revenue)

This works, but I am not sure if it is an anti-pattern. It seems a bit inefficient to repeat the same data (the daily revenue) on multiple rows, littering my data frame with aggregates. My questions are:

Upvotes: 6

Views: 4710

Answers (1)

Frank
Frank

Reputation: 66819

Besides using summarise to aggregate, you can use mutate to assign summary stats to the full column:

data %>% 
  group_by(day) %>% 
  mutate(
    daily_revenue = sum(revenue), 
    revenue_fraction = revenue / daily_revenue
  )

which gives

Source: local data frame [6 x 5]
Groups: day [3]

    day product revenue daily_revenue revenue_fraction
  <dbl>  <fctr>   <dbl>         <dbl>            <dbl>
1     1       A       2             6        0.3333333
2     1       B       4             6        0.6666667
3     2       A       8            15        0.5333333
4     2       B       7            15        0.4666667
5     3       A       9            11        0.8181818
6     3       B       2            11        0.1818182

This works because the value sum(revenue) is recycled to fill in all rows in the group.

Upvotes: 9

Related Questions