Reputation: 174
I have a dataset of consecutive events, grouped by an id and sorted by event start date (start). Each event can have 2 states (cond), a and b. I need to calculate the interval between 2 CONSECUTIVE events (start -end), but only if the newer event meets a certain criterion: cond !="b". If not, interval=NA for that event and move to the next one. Here's a toy dataset:
dt=data.table(stringsAsFactors=FALSE,
id = c(111, 111, 111, 111, 222, 222, 222, 222, 333, 333, 333, 333),
cond=c("a", "a", "a", "a", "a", "b", "a", "a", "a", "b", "b", "b"),
start = c(as.Date( "11/19/2020", format="%m/%d/%Y"),as.Date( "11/27/2020", format="%m/%d/%Y"),as.Date( "12/5/2020", format="%m/%d/%Y"),as.Date( "12/13/2020", format="%m/%d/%Y"),as.Date( "12/21/2020", format="%m/%d/%Y"),as.Date( "12/29/2020", format="%m/%d/%Y"),as.Date( "1/6/2021", format="%m/%d/%Y"),as.Date( "1/14/2021", format="%m/%d/%Y"),as.Date( "1/22/2021", format="%m/%d/%Y"),as.Date( "1/30/2021", format="%m/%d/%Y"),as.Date( "2/7/2021", format="%m/%d/%Y"),as.Date( "2/15/2021", format="%m/%d/%Y")),
end = c(as.Date( "11/24/2020", format="%m/%d/%Y"),as.Date( "12/2/2020", format="%m/%d/%Y"),as.Date( "12/10/2020", format="%m/%d/%Y"),as.Date( "12/18/2020", format="%m/%d/%Y"),as.Date( "12/26/2020", format="%m/%d/%Y"),as.Date( "1/3/2021", format="%m/%d/%Y"),as.Date( "1/11/2021", format="%m/%d/%Y"),as.Date( "1/19/2021", format="%m/%d/%Y"),as.Date( "1/27/2021", format="%m/%d/%Y"),as.Date( "2/4/2021", format="%m/%d/%Y"),as.Date( "2/12/2021", format="%m/%d/%Y"),as.Date( "2/20/2021", format="%m/%d/%Y")
))
The result should look like this:
id cond start end interval
1 111 a 11/19/2020 11/24/2020 3
2 111 a 11/27/2020 12/2/2020 3
3 111 a 12/5/2020 12/10/2020 3
4 111 a 12/13/2020 12/18/2020 NA
5 222 a 12/21/2020 12/26/2020 NA
6 222 b 12/29/2020 1/3/2021 3
7 222 a 1/6/2021 1/11/2021 3
8 222 a 1/14/2021 1/19/2021 NA
9 333 a 1/22/2021 1/27/2021 NA
10 333 b 1/30/2021 2/4/2021 NA
11 333 b 2/7/2021 2/12/2021 NA
12 333 b 2/15/2021 2/20/2021 NA
So, the condition cond!="b" applies only to the "start" row; i would have hoped that something like this would work:
dt[,interval:= shift(start[cond!="b"],n = 1L,type="lead") -end ,id]
but it doesn't. I've also tried a rolling join:
dt[,row_id:=rowid(id)]
dt[, interval :=
.SD[.(cond!="b", id = id, row_id = row_id +1), on=.(id,row_id), roll=-Inf, x.start]-end]
which doesn't do what i need either. I feel that data.table should be able to handle this task pretty easily. i just can't figure out how. Any help will be highly appreciated.
Upvotes: 2
Views: 53
Reputation: 388817
To get interval
value you can subtract the next start
date from end
date for each id
. Turn all the interval
values to NA
where the next value in cond
is 'b'
.
library(data.table)
dt[, interval := as.integer(shift(start, type = 'lead') - end), id]
dt[shift(cond, type = 'lead') == 'b', interval := NA]
dt
# id cond start end interval
# 1: 111 a 2020-11-19 2020-11-24 3
# 2: 111 a 2020-11-27 2020-12-02 3
# 3: 111 a 2020-12-05 2020-12-10 3
# 4: 111 a 2020-12-13 2020-12-18 NA
# 5: 222 a 2020-12-21 2020-12-26 NA
# 6: 222 b 2020-12-29 2021-01-03 3
# 7: 222 a 2021-01-06 2021-01-11 3
# 8: 222 a 2021-01-14 2021-01-19 NA
# 9: 333 a 2021-01-22 2021-01-27 NA
#10: 333 b 2021-01-30 2021-02-04 NA
#11: 333 b 2021-02-07 2021-02-12 NA
#12: 333 b 2021-02-15 2021-02-20 NA
Upvotes: 1