Reputation: 141
I'm a several year lurker, but I've finally found something I couldn't figure out with just old posts. I have a data frame consisting of hundreds of countries, years, and an event variable with a binary indicator:
library('dplyr')
library('data.table')
country<-c("albania","albania","albania","albania","albania","albania","albania","albania","thailand","thailand","thailand","thailand","thailand","thailand","thailand","thailand")
year<-c(1960,1961,1962,1963,1964,1965,1966,1967,1972,1973,1974,1975,1976,1977,1978,1979)
event<-c(0,1,1,0,0,1,1,1,1,1,0,0,1,0,0,0)
input<-data.frame(country=country, year=year, event=event)
input
country year event
1 albania 1960 0
2 albania 1961 1
3 albania 1962 1
4 albania 1963 0
5 albania 1964 0
6 albania 1965 1
7 albania 1966 1
8 albania 1967 1
9 thailand 1972 1
10 thailand 1973 1
11 thailand 1974 0
12 thailand 1975 0
13 thailand 1976 1
14 thailand 1977 0
15 thailand 1978 0
16 thailand 1979 0
I would like to create a new data frame that displays multiple consecutive events for each country with their duration and starting year. For example:
output
country start duration
1 albania 1961 2
2 albania 1965 3
3 thailand 1972 2
4 thailand 1976 1
I've read over, what I believe to be, most of the relevant posts about counting consecutive events by group with dplyr
and data.table
using rle()
and rleid()
, but I can't get them to where I want to be.
Following this example, I can't get a new data frame with more than one event length by country; not just the max, min, etc. and that ignores my need to grab the starting year of the event. Trying to build off this code to get to my desired state left me with lots of errors. The "base code" for the dplyr
examples seems to be some starting point of:
output <- input %>%
group_by(country) %>%
do({
tmp <- with(rle(.$event == 1), lengths[values])
data.frame(country= .$country, Max = if (length(tmp) == 0) 0 else max(tmp))
}) %>%
slice(1L)
That obviously pulls the max, I struggled trying to alter it to pull each event.
Following the data.table
/ rleid
models creates a newly mutated variable counting the duration of consecutive "events" but I was having trouble extracting the "end" years for multiple events within a country. Maybe some lag differencing function using the mutated variable and then extracting all rows with a negative value? Once the row for the end event is flagged the starting year will just be the current year - length. The base code for this approach is:
sum0 <- function(x) { x[x == 1] = sequence(with(rle(x), lengths[values == 1])); x }
setDT(input)[, duration := sum0(event), by = country]
input
country year event duration
1: albania 1960 0 0
2: albania 1961 1 1
3: albania 1962 1 2
4: albania 1963 0 0
5: albania 1964 0 0
6: albania 1965 1 1
7: albania 1966 1 2
8: albania 1967 1 3
9: thailand 1972 1 1
10: thailand 1973 1 2
11: thailand 1974 0 0
12: thailand 1975 0 0
13: thailand 1976 1 1
14: thailand 1977 0 0
15: thailand 1978 0 0
16: thailand 1979 0 0
There are another 7-10 posts I looked through but didn't link as they were similar in nature to the two I referenced. I want to thank anyone in advance who has any advice. I hope I followed all protocol for asking a question; I tried to be careful and follow the rules. Thanks for all the great work you all do! You've gotten me through 5-6 years of learning R and JAGS.
Upvotes: 1
Views: 404
Reputation: 17289
Is this what you want:
library(data.table)
setDT(input)
input[, .(event = event[1], start = year[1], duration = .N),
by = .(country, rleidv(event))][event == 1][
, c('event', 'rleidv') := NULL][]
# country start duration
# 1: albania 1961 2
# 2: albania 1965 3
# 3: thailand 1972 2
# 4: thailand 1976 1
As noted by Frank in comment, this solution is optimized by data.table
in computation, which makes it more efficient. The if(cond) ...
in j
expression won't be optimized.
Upvotes: 2
Reputation: 66819
Here's what I would do (leaving dplyr out of it):
setDT(input)
input[,
if (first(event) == 1) .(year = first(year), N = .N)
, by=.(country, g = rleid(country, event))][, !"g"]
country year N
1: albania 1961 2
2: albania 1965 3
3: thailand 1972 2
4: thailand 1976 1
Not very efficient, but hopefully easy enough to follow.
Upvotes: 3