Reputation: 11697
Let's say I have a data.frame like so:
user_df = read.table(text = "id industry year
1 Government 1999
2 Government 1999
3 Government 1999
4 Private 1999
5 NGO 1999
1 Government 2000
2 Government 2000
3 Government 2000
4 Government 2000
1 Government 2001
5 Government 2001
2 Private 2001
3 Private 2001
4 Private 2001", header = T)
For each user I have a unique id
, industry
, and year
.
I'm trying to compute a cumulative count of the people who have ever worked Government, so the cumulative count should be a count of the total number of unique users for that year and all preceding years.
I know I can do an ordinary cumulative sum like so:
user_df %>% group_by(year, industry) %>% summarize(cum_sum = cumsum(n_distinct(id)))
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 2
6 2001 Private 3
However, this isn't what I want since the sums in the year 2000
and 2001
will include people who have already been included in 1999
. I want each year to be a cumulative count of the total number of unique users that have ever worked in Government at a given year. I couldn't figure out the right way to do this in dplyr
.
So the correct output should look like:
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3
Upvotes: 2
Views: 105
Reputation: 269852
1) sqldf This can be implemented through a complex self-join in sql. This joins each row to the rows having the same industry and same year or before and then groups them by year and industry counting the distinct id's.
library(sqldf)
sqldf("select a.year, a.industry, count(distinct b.id) cum_sum
from user_df a
left join user_df b on b.industry = a.industry and b.year <= a.year
group by a.year, a.industry")
giving:
year industry cum_sum
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3
2) baseA base only solution is formed by merging the data frame to itself on industry and then subset to the same or earlier year and aggregate over industry and year. This is inefficient since unlike the SQL statement which filters as it joins this creates the entire join before filtering it down; however, if your data is not too large this may be sufficient.
m <- merge(user_df, user_df, by = "indstry")
s <- subset(m, year.y <= year.x)
ag <- aggregate(id.y ~ industry + year.x, s, function(x) length(unique(x)))
names(ag) <- sub("\\..*", "", names(ag))
ag
giving:
industry year id
1 Government 1999 3
2 NGO 1999 1
3 Private 1999 1
4 Government 2000 4
5 Government 2001 5
6 Private 2001 3
Upvotes: 2
Reputation: 40051
One option might be:
user_df %>%
group_by(industry) %>%
mutate(cum_sum = cumsum(!duplicated(id))) %>%
group_by(year, industry) %>%
summarise(cum_sum = max(cum_sum))
year industry cum_sum
<int> <fct> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3
Upvotes: 5