ok1more
ok1more

Reputation: 900

In line group by in dplyr to mutate columns

I am trying to create new columns grouped by different columns but I am not sure if the way I am doing it is the best way to use group_by. I am wondering if there is a way I can group_by in line?

I know it can be done using data.table package where the syntax is of type DT[i,j, by].

But since this is a small piece in a bigger code which uses tidyverse and works great as is, I just don't want to deviate from that.

## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10) 
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-") 
customers <- sample.int(50:100,50) 
sales <- sample.int(500:5000,50)

df <- bind_cols(data.frame(state, county,customers,sales))

## workflow

df2 <- df %>%
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales)) %>% 
  ungroup %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers),
         saleInCounty = sum(sales)) %>% 
  ungroup %>% 
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

I want my code to look like

df3 <- df %>%
  mutate(customerInState = sum(customers, by = state),
         saleInState = sum(sales, by = state),
         customerInCounty = sum(customers, by = county),
         saleInCounty = sum(sales, by = county),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState,
         minSale = min(salePerCountyPercent, by = state))

it runs without errors, but I know the output is not right

I understand that it may be possible to juggle around the mutates to get what I need with less amount of group_bys. But the questions is, if there is away to do in line group by in dplyr

Upvotes: 3

Views: 1533

Answers (3)

Andrew
Andrew

Reputation: 5138

You could create wrapper to do what you want. This specific solution works if you have one grouping variable. Good luck!

library(tidyverse)

mutate_by <- function(.data, group, ...) {

  group_by(.data, !!enquo(group)) %>%
    mutate(...) %>%
    ungroup

}

df1 <- df %>%
  mutate_by(state, 
            customerInState = sum(customers),
            saleInState = sum(sales)) %>%
  mutate_by(county,
            customerInCounty = sum(customers),
            saleInCounty = sum(sales)) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(state,
            minSale = min(salePerCountyPercent))

identical(df2, df1)
[1] TRUE

EDIT: or, more concicely / similar to your code:

df %>%
  mutate_by(customerInState = sum(customers),
            saleInState = sum(sales), group = state) %>%
  mutate_by(customerInCounty = sum(customers),
            saleInCounty = sum(sales), group = county) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(minSale = min(salePerCountyPercent), group = state)

Upvotes: 5

Rui Barradas
Rui Barradas

Reputation: 76402

You can do it in two steps, creating two data sets, then left_join them.

library(dplyr)

df2 <- df %>%
  group_by(state) %>% 
  summarise(customerInState = sum(customers),
         saleInState = sum(sales))

df3 <- df %>%
  group_by(state, county) %>%
  summarise(customerInCounty = sum(customers),
            saleInCounty = sum(sales))

df2 <- left_join(df2, df3) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent))

Final clean up.

rm(df3)

Upvotes: 3

January
January

Reputation: 17090

Ah, you mean the syntax style. No, this is not how tidyverse runs, I'm afraid. You want tidyverse, you better use pipes. However: (i) once you grouped something, it stays grouped until you group again with a different column. (ii) No need to ungroup if you group again. We can therefore shorten your code:

df3 <- df %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers), 
         saleInCounty = sum(sales)) %>% 
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

Two mutates and two group_by's.

Now: the order of columns is different, but we can easily test that the data is identical:

identical((df3 %>% select(colnames(df2))), (df2)) # TRUE

(iii) I have no idea about the administrative structure of the US, but I assume that counties are nested within states, correct? Then how about using summarize? Do you need to keep all the individual sales, or is it enough to generate per county and/or per state statistics?

Upvotes: 3

Related Questions