Josh Brinks
Josh Brinks

Reputation: 141

Count Multiple Consecutive Events by Group w/ Starting Year

I'm a several year lurker, but I've finally found something I couldn't figure out with just old posts. I have a data frame consisting of hundreds of countries, years, and an event variable with a binary indicator:

library('dplyr')
library('data.table')

country<-c("albania","albania","albania","albania","albania","albania","albania","albania","thailand","thailand","thailand","thailand","thailand","thailand","thailand","thailand")
year<-c(1960,1961,1962,1963,1964,1965,1966,1967,1972,1973,1974,1975,1976,1977,1978,1979)
event<-c(0,1,1,0,0,1,1,1,1,1,0,0,1,0,0,0)
input<-data.frame(country=country, year=year, event=event)

input    

    country year event
1   albania 1960     0
2   albania 1961     1
3   albania 1962     1
4   albania 1963     0
5   albania 1964     0
6   albania 1965     1
7   albania 1966     1
8   albania 1967     1
9  thailand 1972     1
10 thailand 1973     1
11 thailand 1974     0
12 thailand 1975     0
13 thailand 1976     1
14 thailand 1977     0
15 thailand 1978     0
16 thailand 1979     0

I would like to create a new data frame that displays multiple consecutive events for each country with their duration and starting year. For example:

output

   country start duration
1  albania 1961        2
2  albania 1965        3
3 thailand 1972        2
4 thailand 1976        1

I've read over, what I believe to be, most of the relevant posts about counting consecutive events by group with dplyr and data.table using rle() and rleid(), but I can't get them to where I want to be.

Following this example, I can't get a new data frame with more than one event length by country; not just the max, min, etc. and that ignores my need to grab the starting year of the event. Trying to build off this code to get to my desired state left me with lots of errors. The "base code" for the dplyr examples seems to be some starting point of:

output <- input %>%
group_by(country) %>%
do({
tmp <- with(rle(.$event == 1), lengths[values])
data.frame(country= .$country, Max = if (length(tmp) == 0) 0 else max(tmp))
 }) %>%
 slice(1L)

That obviously pulls the max, I struggled trying to alter it to pull each event.

Following the data.table / rleid models creates a newly mutated variable counting the duration of consecutive "events" but I was having trouble extracting the "end" years for multiple events within a country. Maybe some lag differencing function using the mutated variable and then extracting all rows with a negative value? Once the row for the end event is flagged the starting year will just be the current year - length. The base code for this approach is:

sum0 <- function(x) { x[x == 1] = sequence(with(rle(x), lengths[values == 1])); x }
setDT(input)[, duration := sum0(event), by = country]

input

     country year event duration
 1:  albania 1960     0        0
 2:  albania 1961     1        1
 3:  albania 1962     1        2 
 4:  albania 1963     0        0
 5:  albania 1964     0        0
 6:  albania 1965     1        1
 7:  albania 1966     1        2
 8:  albania 1967     1        3
 9: thailand 1972     1        1
10: thailand 1973     1        2
11: thailand 1974     0        0
12: thailand 1975     0        0
13: thailand 1976     1        1
14: thailand 1977     0        0
15: thailand 1978     0        0
16: thailand 1979     0        0

There are another 7-10 posts I looked through but didn't link as they were similar in nature to the two I referenced. I want to thank anyone in advance who has any advice. I hope I followed all protocol for asking a question; I tried to be careful and follow the rules. Thanks for all the great work you all do! You've gotten me through 5-6 years of learning R and JAGS.

Upvotes: 1

Views: 404

Answers (2)

mt1022
mt1022

Reputation: 17289

Is this what you want:

library(data.table)

setDT(input)
input[, .(event = event[1], start = year[1], duration = .N),
      by = .(country, rleidv(event))][event == 1][
          , c('event', 'rleidv') := NULL][]

#     country start duration
# 1:  albania  1961        2
# 2:  albania  1965        3
# 3: thailand  1972        2
# 4: thailand  1976        1

As noted by Frank in comment, this solution is optimized by data.table in computation, which makes it more efficient. The if(cond) ... in j expression won't be optimized.

Upvotes: 2

Frank
Frank

Reputation: 66819

Here's what I would do (leaving dplyr out of it):

setDT(input)

input[, 
  if (first(event) == 1) .(year = first(year), N = .N)
, by=.(country, g = rleid(country, event))][, !"g"]

    country year N
1:  albania 1961 2
2:  albania 1965 3
3: thailand 1972 2
4: thailand 1976 1

Not very efficient, but hopefully easy enough to follow.

Upvotes: 3

Related Questions