Reputation: 8588
Quite often I need to calculate a new column for an R dataframe (in long form), whose value should depend on an aggregate function (e.g. sum) of a group. For instance, I might want to know what fraction of sales a product accounts for on any given day:
daily fraction = revenue for product i on day d / sum or revenue for all products on day d
My current strategy is to summarise and join:
library(dplyr)
join_summary <- function(data, ...) left_join(data, summarise(data, ...))
data = data.frame(
day = c(1,1,2,2,3,3),
product = rep(c("A", "B"), 3),
revenue = c(2, 4, 8, 7, 9, 2)
)
data2 <- data %>%
group_by(day) %>%
join_summary(daily_revenue = sum(revenue)) %>%
mutate(revenue_fraction = revenue / daily_revenue)
This works, but I am not sure if it is an anti-pattern. It seems a bit inefficient to repeat the same data (the daily revenue) on multiple rows, littering my data frame with aggregates. My questions are:
dplyr
or the wider Hadleyverse?join_summary
, or can it be done with the existing dplyr
verbs? (I prefer to stay within the "pipe flow", though.)Upvotes: 6
Views: 4710
Reputation: 66819
Besides using summarise
to aggregate, you can use mutate
to assign summary stats to the full column:
data %>%
group_by(day) %>%
mutate(
daily_revenue = sum(revenue),
revenue_fraction = revenue / daily_revenue
)
which gives
Source: local data frame [6 x 5]
Groups: day [3]
day product revenue daily_revenue revenue_fraction
<dbl> <fctr> <dbl> <dbl> <dbl>
1 1 A 2 6 0.3333333
2 1 B 4 6 0.6666667
3 2 A 8 15 0.5333333
4 2 B 7 15 0.4666667
5 3 A 9 11 0.8181818
6 3 B 2 11 0.1818182
This works because the value sum(revenue)
is recycled to fill in all rows in the group.
Upvotes: 9