Calculate total number of distinct users across years, cumulatively

Question

Let's say I have a data.frame like so:

user_df = read.table(text = "id industry year
1 Government 1999
2 Government 1999
3 Government 1999
4 Private 1999
5 NGO 1999
1 Government 2000
2 Government 2000
3 Government 2000
4 Government 2000
1 Government 2001
5 Government 2001
2 Private 2001
3 Private 2001
4 Private 2001", header = T)

For each user I have a unique id, industry, and year.

I'm trying to compute a cumulative count of the people who have ever worked Government, so the cumulative count should be a count of the total number of unique users for that year and all preceding years.

I know I can do an ordinary cumulative sum like so:

user_df %>% group_by(year, industry) %>% summarize(cum_sum = cumsum(n_distinct(id)))

   year industry   cum_sum
           
1  1999 Government       3
2  1999 NGO              1
3  1999 Private          1
4  2000 Government       4
5  2001 Government       2
6  2001 Private          3

However, this isn't what I want since the sums in the year 2000 and 2001 will include people who have already been included in 1999. I want each year to be a cumulative count of the total number of unique users that have ever worked in Government at a given year. I couldn't figure out the right way to do this in dplyr.

So the correct output should look like:

   year industry   cum_sum
           
1  1999 Government       3
2  1999 NGO              1
3  1999 Private          1
4  2000 Government       4
5  2001 Government       5
6  2001 Private          3

tmfmnk · Accepted Answer

One option might be:

user_df %>%
 group_by(industry) %>%
 mutate(cum_sum = cumsum(!duplicated(id))) %>%
 group_by(year, industry) %>%
 summarise(cum_sum = max(cum_sum))

   year industry   cum_sum
           
1  1999 Government       3
2  1999 NGO              1
3  1999 Private          1
4  2000 Government       4
5  2001 Government       5
6  2001 Private          3

Calculate total number of distinct users across years, cumulatively

Answers (2)

Related Questions