Tim Wilcox
Tim Wilcox

Reputation: 1331

How to cumsum in R based on certain fields?

I apologize for the length but it is necessary in order to not skip over details and make this more confusing than it is already.

Below is the sample data and some of the manipulations that I have done so far.

library(dplyr)
library(tidyverse)

emp <- c(1,2,3,4,5,6,7,8,1,12,54,101,33,159,201,261,110,195,131,228)
small <- c(1,1,1,1,1,1,1,1,1,1,1,2,1,3,3,4,2,3,2,3)
area <-c(003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003)
twodigit <-c(11,22,11,22,23,22,11,31,44,45,21,44,45,62,72,22,45,72,45,21)

smbtest2 <- data.frame(emp,small,area,twodigit)

So before I get too far the goal is to sum the employment (emp) by small (schema below) and then break it down by two digit (industry code). In this simple example, I want the top 3 industries per every small category. I am trying to cumsum because if one is in the first category (0 to 99) then it will be in the second category (0 to 149).

smbsummary3<-smbtest2 %>% 
group_by(area,small,twodigit) %>%
summarise(emp = sum(emp), worksites = n(), 
        .groups = 'drop_last')%>%
slice_max(emp,n=3)

smbsummary4<-smbsummary3 %>% 
ungroup %>% 
complete(area, small = unique(small)) %>% 
fill(emp, worksites)

 Schema for small
 1     0 to 99
 2     0 to 149
 3     0 to 249
 4     0 to 499

Desired Result

   area     small   twodigit    emp    worksites
   003        1        21        54        1
   003        1        45        45        2       (12+33)
   003        1        22        12        3       (2+4+6)
   003        2        45       286        4       (12+33+110+131)
   003        2        44       102        2       (1+101)
   003        2        21        54        1

At present, it sums based purely on small which is what it should do based on the code. However, my question is how do I get it to cumsum (cumulative sum) based on the small category?

Below is my latest attempt. It does not add up to the correct answer but I think it is close to being the correct set of commands.

smbsummary3<-smbtest2 %>% 
group_by(area,small,twodigit) %>%
summarise(emp = sum(emp), worksites = n(), 
        .groups = 'drop_last')%>%
mutate(emp = cumsum(emp),
     worksites = cumsum(worksites))%>%
slice_max(emp,n=3)

Upvotes: 0

Views: 67

Answers (1)

Ben
Ben

Reputation: 30474

I was going to try to explain in a comment, but this seemed easier.

Maybe you want to group_by just area and twodigit before doing your cumulative sum.

Then, group_by again to select the top 3 emp values by area and small. The resulting output looks very similar (could not find small 2 and twodigit 21 in dataset).

smbtest2 %>%
  group_by(area, small, twodigit) %>%
  summarise(emp = sum(emp), 
            worksites = n(), 
            .groups = 'drop_last') %>%
  group_by(area, twodigit) %>%
  mutate(emp = cumsum(emp),
         worksites = cumsum(worksites)) %>%
  group_by(area, small) %>%
  slice_max(emp, n = 3) %>%
  arrange(area, small, desc(emp))

Output

   area small twodigit   emp worksites
  <dbl> <dbl>    <dbl> <dbl>     <int>
1     3     1       21    54         1
2     3     1       45    45         2
3     3     1       22    12         3
4     3     2       45   286         4
5     3     2       44   102         2
6     3     3       72   396         2
7     3     3       21   282         2
8     3     3       62   159         1
9     3     4       22   273         4

Upvotes: 1

Related Questions