Cumulative sum for 2 criteria in R

Question

I have a database where I want to calculate the cumulative sum of 2 criteria

dfdata = data.frame(car = c("toyota","toyota","toyota","toyota","toyota",
                            "honda","honda","honda","honda",
                            "lada","lada","lada","lada"),
                    year = c(2000,2000,2001,2001,2002,2001,2001,2002,2002,2003,2004,2005,2006),
                    id = c("a","b","a","c","a","d","d","d","e","f","f","f","f"))

You can see down the data:

dfdata
      car year id
1  toyota 2000  a
2  toyota 2000  b
3  toyota 2001  a
4  toyota 2001  c
5  toyota 2002  a
6   honda 2001  d
7   honda 2001  d
8   honda 2002  d
9   honda 2002  e
10   lada 2003  f
11   lada 2004  f
12   lada 2005  f
13   lada 2006  f

Imagine I was observing cars passing by and that the plate on it is an "ID". So a car with the same id is the exact same car.

I want the sum of cars companies I've seen in one year
I want the cumulative sum of cars companies I've seen across the years
I want the cumulative sum of the cars companies I've seen more than once (counting the ones I've seen in the same year and the other years AND another column counting the ones that I've seen ONLY in the other years)

Here is how I got point 1. and point 2.

dfdata %>%  
  group_by(car, year) %>% 
  dplyr::summarise(nb = n())  %>% 
  dplyr::mutate(cs = cumsum(nb)) %>% 
  ungroup()

nb is the number of cars from a certain manufacturer I've seen in a particular year. cs is the cumulative sum of the cars across the years.

# A tibble: 9 x 4
  car     year    nb    cs
      
1 honda   2001     2     2
2 honda   2002     2     4
3 lada    2003     1     1
4 lada    2004     1     2
5 lada    2005     1     3
6 lada    2006     1     4
7 toyota  2000     2     2
8 toyota  2001     2     4
9 toyota  2002     1     5

But notice that I've lost the ID column. How can I get the number of cars that I've seen multiple times for the same ID.

Final output should be based on grouping ID (to answer point 3):

     car year nb cs curetrap curetrap.no.same.year
1  honda 2001  2  2        1                     0
2  honda 2002  2  4        2                     1
3   lada 2003  1  1        0                     0
4   lada 2004  1  2        1                     1
5   lada 2005  1  3        2                     2
6   lada 2006  1  4        3                     3
7 toyota 2000  2  2        0                     0
8 toyota 2001  2  4        1                     1
9 toyota 2002  1  5        2                     2

This is because "honda" have been seen 2 times in 2001 and 2 times in 2002. So the cumulative sum is 2 in 2001 and 2 + 2 in 2002. Then, within the same year I've seen the honda "d" twice, meaning that I "recaptured" the "d" 2001 honda hence the "1" in curetrap for 2001. In 2002, I recaptured the honda "d" again, thus the cumulative sum increased. For "curetrap.no.same.year" it's the same thing, but I want to ignore the recapture of the honda "d" in 2001 since it's the same year.

How is it possible to do that? Since I'm loosing the ID information, do I need to do it in 2 steps?

So far this is what I have:

tab.df = cbind(table(dfdata$id,dfdata$year),
      car = as.character(dfdata[match(unique(dfdata$id),table = dfdata$id),"car"]))
df.df = as.data.frame(tab.df)

  2000 2001 2002 2003 2004 2005 2006    car
a    1    1    1    0    0    0    0 toyota
b    1    0    0    0    0    0    0 toyota
c    0    1    0    0    0    0    0 toyota
d    0    2    1    0    0    0    0  honda
e    0    0    1    0    0    0    0  honda
f    0    0    0    1    1    1    1   lada

Which shows all the times I've seen a car in a year for a certain ID.

Pierre Gramme · Accepted Answer

You can factor the problem into 2 steps by first adding binary variables in your original dataset which will flag the records you want to count, and then by simply computing sum and cumsum of these flags.

The following code gives the result you want

dfdata %>% 
  group_by(car, id) %>% 
  arrange(year, .by_group=TRUE) %>% 
  dplyr::mutate(already_seen = row_number()>1, already_seen_diff_year = year>year[1])  %>% 
  group_by(car, year) %>% 
  dplyr::summarise(nb = n(), cs = nb, curetrap = sum(already_seen), curetrap.no.same.year = sum(already_seen_diff_year))  %>% 
  dplyr::mutate_at(vars(cs, curetrap, curetrap.no.same.year), cumsum) %>% 
  ungroup()

NB: duplicating variable cs = nb is just a trick to write easily the subsequent call to mutate_at

Cumulative sum for 2 criteria in R

Answers (1)

Related Questions