Reputation: 1331
I apologize for the length but it is necessary in order to not skip over details and make this more confusing than it is already.
Below is the sample data and some of the manipulations that I have done so far.
library(dplyr)
library(tidyverse)
emp <- c(1,2,3,4,5,6,7,8,1,12,54,101,33,159,201,261,110,195,131,228)
small <- c(1,1,1,1,1,1,1,1,1,1,1,2,1,3,3,4,2,3,2,3)
area <-c(003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003,003)
twodigit <-c(11,22,11,22,23,22,11,31,44,45,21,44,45,62,72,22,45,72,45,21)
smbtest2 <- data.frame(emp,small,area,twodigit)
So before I get too far the goal is to sum the employment (emp) by small (schema below) and then break it down by two digit (industry code). In this simple example, I want the top 3 industries per every small category. I am trying to cumsum because if one is in the first category (0 to 99) then it will be in the second category (0 to 149).
smbsummary3<-smbtest2 %>%
group_by(area,small,twodigit) %>%
summarise(emp = sum(emp), worksites = n(),
.groups = 'drop_last')%>%
slice_max(emp,n=3)
smbsummary4<-smbsummary3 %>%
ungroup %>%
complete(area, small = unique(small)) %>%
fill(emp, worksites)
Schema for small
1 0 to 99
2 0 to 149
3 0 to 249
4 0 to 499
Desired Result
area small twodigit emp worksites
003 1 21 54 1
003 1 45 45 2 (12+33)
003 1 22 12 3 (2+4+6)
003 2 45 286 4 (12+33+110+131)
003 2 44 102 2 (1+101)
003 2 21 54 1
At present, it sums based purely on small which is what it should do based on the code. However, my question is how do I get it to cumsum (cumulative sum) based on the small category?
Below is my latest attempt. It does not add up to the correct answer but I think it is close to being the correct set of commands.
smbsummary3<-smbtest2 %>%
group_by(area,small,twodigit) %>%
summarise(emp = sum(emp), worksites = n(),
.groups = 'drop_last')%>%
mutate(emp = cumsum(emp),
worksites = cumsum(worksites))%>%
slice_max(emp,n=3)
Upvotes: 0
Views: 67
Reputation: 30474
I was going to try to explain in a comment, but this seemed easier.
Maybe you want to group_by
just area
and twodigit
before doing your cumulative sum.
Then, group_by
again to select the top 3 emp
values by area
and small
. The resulting output looks very similar (could not find small
2 and twodigit
21 in dataset).
smbtest2 %>%
group_by(area, small, twodigit) %>%
summarise(emp = sum(emp),
worksites = n(),
.groups = 'drop_last') %>%
group_by(area, twodigit) %>%
mutate(emp = cumsum(emp),
worksites = cumsum(worksites)) %>%
group_by(area, small) %>%
slice_max(emp, n = 3) %>%
arrange(area, small, desc(emp))
Output
area small twodigit emp worksites
<dbl> <dbl> <dbl> <dbl> <int>
1 3 1 21 54 1
2 3 1 45 45 2
3 3 1 22 12 3
4 3 2 45 286 4
5 3 2 44 102 2
6 3 3 72 396 2
7 3 3 21 282 2
8 3 3 62 159 1
9 3 4 22 273 4
Upvotes: 1