Reputation: 900
I am trying to create new columns grouped by different columns but I am not sure if the way I am doing it is the best way to use group_by. I am wondering if there is a way I can group_by in line?
I know it can be done using data.table package where the syntax is of type DT[i,j, by].
But since this is a small piece in a bigger code which uses tidyverse and works great as is, I just don't want to deviate from that.
## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10)
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-")
customers <- sample.int(50:100,50)
sales <- sample.int(500:5000,50)
df <- bind_cols(data.frame(state, county,customers,sales))
## workflow
df2 <- df %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales)) %>%
ungroup %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
ungroup %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
ungroup
I want my code to look like
df3 <- df %>%
mutate(customerInState = sum(customers, by = state),
saleInState = sum(sales, by = state),
customerInCounty = sum(customers, by = county),
saleInCounty = sum(sales, by = county),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState,
minSale = min(salePerCountyPercent, by = state))
it runs without errors, but I know the output is not right
I understand that it may be possible to juggle around the mutates to get what I need with less amount of group_bys. But the questions is, if there is away to do in line group by in dplyr
Upvotes: 3
Views: 1533
Reputation: 5138
You could create wrapper to do what you want. This specific solution works if you have one grouping variable. Good luck!
library(tidyverse)
mutate_by <- function(.data, group, ...) {
group_by(.data, !!enquo(group)) %>%
mutate(...) %>%
ungroup
}
df1 <- df %>%
mutate_by(state,
customerInState = sum(customers),
saleInState = sum(sales)) %>%
mutate_by(county,
customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(state,
minSale = min(salePerCountyPercent))
identical(df2, df1)
[1] TRUE
EDIT: or, more concicely / similar to your code:
df %>%
mutate_by(customerInState = sum(customers),
saleInState = sum(sales), group = state) %>%
mutate_by(customerInCounty = sum(customers),
saleInCounty = sum(sales), group = county) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(minSale = min(salePerCountyPercent), group = state)
Upvotes: 5
Reputation: 76402
You can do it in two steps, creating two data sets, then left_join
them.
library(dplyr)
df2 <- df %>%
group_by(state) %>%
summarise(customerInState = sum(customers),
saleInState = sum(sales))
df3 <- df %>%
group_by(state, county) %>%
summarise(customerInCounty = sum(customers),
saleInCounty = sum(sales))
df2 <- left_join(df2, df3) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent))
Final clean up.
rm(df3)
Upvotes: 3
Reputation: 17090
Ah, you mean the syntax style. No, this is not how tidyverse runs, I'm afraid. You want tidyverse, you better use pipes. However: (i) once you grouped something, it stays grouped until you group again with a different column. (ii) No need to ungroup if you group again. We can therefore shorten your code:
df3 <- df %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
ungroup
Two mutates and two group_by's.
Now: the order of columns is different, but we can easily test that the data is identical:
identical((df3 %>% select(colnames(df2))), (df2)) # TRUE
(iii) I have no idea about the administrative structure of the US, but I assume that counties are nested within states, correct? Then how about using summarize? Do you need to keep all the individual sales, or is it enough to generate per county and/or per state statistics?
Upvotes: 3