Reputation: 3555
I have a database where I want to calculate the cumulative sum of 2 criteria
dfdata = data.frame(car = c("toyota","toyota","toyota","toyota","toyota",
"honda","honda","honda","honda",
"lada","lada","lada","lada"),
year = c(2000,2000,2001,2001,2002,2001,2001,2002,2002,2003,2004,2005,2006),
id = c("a","b","a","c","a","d","d","d","e","f","f","f","f"))
You can see down the data:
dfdata
car year id
1 toyota 2000 a
2 toyota 2000 b
3 toyota 2001 a
4 toyota 2001 c
5 toyota 2002 a
6 honda 2001 d
7 honda 2001 d
8 honda 2002 d
9 honda 2002 e
10 lada 2003 f
11 lada 2004 f
12 lada 2005 f
13 lada 2006 f
Imagine I was observing cars passing by and that the plate on it is an "ID". So a car with the same id is the exact same car.
Here is how I got point 1. and point 2.
dfdata %>%
group_by(car, year) %>%
dplyr::summarise(nb = n()) %>%
dplyr::mutate(cs = cumsum(nb)) %>%
ungroup()
nb is the number of cars from a certain manufacturer I've seen in a particular year. cs is the cumulative sum of the cars across the years.
# A tibble: 9 x 4
car year nb cs
<fct> <dbl> <int> <int>
1 honda 2001 2 2
2 honda 2002 2 4
3 lada 2003 1 1
4 lada 2004 1 2
5 lada 2005 1 3
6 lada 2006 1 4
7 toyota 2000 2 2
8 toyota 2001 2 4
9 toyota 2002 1 5
But notice that I've lost the ID column. How can I get the number of cars that I've seen multiple times for the same ID.
Final output should be based on grouping ID (to answer point 3):
car year nb cs curetrap curetrap.no.same.year
1 honda 2001 2 2 1 0
2 honda 2002 2 4 2 1
3 lada 2003 1 1 0 0
4 lada 2004 1 2 1 1
5 lada 2005 1 3 2 2
6 lada 2006 1 4 3 3
7 toyota 2000 2 2 0 0
8 toyota 2001 2 4 1 1
9 toyota 2002 1 5 2 2
This is because "honda" have been seen 2 times in 2001 and 2 times in 2002. So the cumulative sum is 2 in 2001 and 2 + 2 in 2002. Then, within the same year I've seen the honda "d" twice, meaning that I "recaptured" the "d" 2001 honda hence the "1" in curetrap for 2001. In 2002, I recaptured the honda "d" again, thus the cumulative sum increased. For "curetrap.no.same.year" it's the same thing, but I want to ignore the recapture of the honda "d" in 2001 since it's the same year.
How is it possible to do that? Since I'm loosing the ID information, do I need to do it in 2 steps?
So far this is what I have:
tab.df = cbind(table(dfdata$id,dfdata$year),
car = as.character(dfdata[match(unique(dfdata$id),table = dfdata$id),"car"]))
df.df = as.data.frame(tab.df)
2000 2001 2002 2003 2004 2005 2006 car
a 1 1 1 0 0 0 0 toyota
b 1 0 0 0 0 0 0 toyota
c 0 1 0 0 0 0 0 toyota
d 0 2 1 0 0 0 0 honda
e 0 0 1 0 0 0 0 honda
f 0 0 0 1 1 1 1 lada
Which shows all the times I've seen a car in a year for a certain ID.
Upvotes: 1
Views: 746
Reputation: 1254
You can factor the problem into 2 steps by first adding binary variables in your original dataset which will flag the records you want to count, and then by simply computing sum and cumsum of these flags.
The following code gives the result you want
dfdata %>%
group_by(car, id) %>%
arrange(year, .by_group=TRUE) %>%
dplyr::mutate(already_seen = row_number()>1, already_seen_diff_year = year>year[1]) %>%
group_by(car, year) %>%
dplyr::summarise(nb = n(), cs = nb, curetrap = sum(already_seen), curetrap.no.same.year = sum(already_seen_diff_year)) %>%
dplyr::mutate_at(vars(cs, curetrap, curetrap.no.same.year), cumsum) %>%
ungroup()
NB: duplicating variable cs = nb
is just a trick to write easily the subsequent call to mutate_at
Upvotes: 1