Parseltongue
Parseltongue

Reputation: 11697

Calculate total number of distinct users across years, cumulatively

Let's say I have a data.frame like so:

user_df = read.table(text = "id industry year
1 Government 1999
2 Government 1999
3 Government 1999
4 Private 1999
5 NGO 1999
1 Government 2000
2 Government 2000
3 Government 2000
4 Government 2000
1 Government 2001
5 Government 2001
2 Private 2001
3 Private 2001
4 Private 2001", header = T)

For each user I have a unique id, industry, and year.

I'm trying to compute a cumulative count of the people who have ever worked Government, so the cumulative count should be a count of the total number of unique users for that year and all preceding years.

I know I can do an ordinary cumulative sum like so:

user_df %>% group_by(year, industry) %>% summarize(cum_sum = cumsum(n_distinct(id)))

   year industry   cum_sum
  <int> <chr>        <int>
1  1999 Government       3
2  1999 NGO              1
3  1999 Private          1
4  2000 Government       4
5  2001 Government       2
6  2001 Private          3

However, this isn't what I want since the sums in the year 2000 and 2001 will include people who have already been included in 1999. I want each year to be a cumulative count of the total number of unique users that have ever worked in Government at a given year. I couldn't figure out the right way to do this in dplyr.

So the correct output should look like:

   year industry   cum_sum
  <int> <chr>        <int>
1  1999 Government       3
2  1999 NGO              1
3  1999 Private          1
4  2000 Government       4
5  2001 Government       5
6  2001 Private          3

Upvotes: 2

Views: 105

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269852

1) sqldf This can be implemented through a complex self-join in sql. This joins each row to the rows having the same industry and same year or before and then groups them by year and industry counting the distinct id's.

library(sqldf)
sqldf("select a.year, a.industry, count(distinct b.id) cum_sum
  from user_df a
  left join user_df b on b.industry = a.industry and b.year <= a.year
  group by a.year, a.industry")

giving:

  year   industry cum_sum
1 1999 Government       3
2 1999        NGO       1
3 1999    Private       1
4 2000 Government       4
5 2001 Government       5
6 2001    Private       3

2) baseA base only solution is formed by merging the data frame to itself on industry and then subset to the same or earlier year and aggregate over industry and year. This is inefficient since unlike the SQL statement which filters as it joins this creates the entire join before filtering it down; however, if your data is not too large this may be sufficient.

m <- merge(user_df, user_df, by = "indstry")
s <- subset(m, year.y <= year.x)
ag <- aggregate(id.y ~ industry + year.x, s, function(x) length(unique(x)))
names(ag) <- sub("\\..*", "", names(ag))
ag

giving:

    industry year id
1 Government 1999  3
2        NGO 1999  1
3    Private 1999  1
4 Government 2000  4
5 Government 2001  5
6    Private 2001  3

Upvotes: 2

tmfmnk
tmfmnk

Reputation: 40051

One option might be:

user_df %>%
 group_by(industry) %>%
 mutate(cum_sum = cumsum(!duplicated(id))) %>%
 group_by(year, industry) %>%
 summarise(cum_sum = max(cum_sum))

   year industry   cum_sum
  <int> <fct>        <int>
1  1999 Government       3
2  1999 NGO              1
3  1999 Private          1
4  2000 Government       4
5  2001 Government       5
6  2001 Private          3

Upvotes: 5

Related Questions