maximus
maximus

Reputation: 174

R interval between consecutive events meeting condition - data.table way

I have a dataset of consecutive events, grouped by an id and sorted by event start date (start). Each event can have 2 states (cond), a and b. I need to calculate the interval between 2 CONSECUTIVE events (start -end), but only if the newer event meets a certain criterion: cond !="b". If not, interval=NA for that event and move to the next one. Here's a toy dataset:

dt=data.table(stringsAsFactors=FALSE,
      id = c(111, 111, 111, 111, 222, 222, 222, 222, 333, 333, 333, 333),
    cond=c("a", "a", "a", "a", "a", "b", "a", "a", "a", "b", "b", "b"),
   start = c(as.Date( "11/19/2020", format="%m/%d/%Y"),as.Date( "11/27/2020", format="%m/%d/%Y"),as.Date( "12/5/2020", format="%m/%d/%Y"),as.Date( "12/13/2020", format="%m/%d/%Y"),as.Date( "12/21/2020", format="%m/%d/%Y"),as.Date( "12/29/2020", format="%m/%d/%Y"),as.Date( "1/6/2021", format="%m/%d/%Y"),as.Date( "1/14/2021", format="%m/%d/%Y"),as.Date( "1/22/2021", format="%m/%d/%Y"),as.Date( "1/30/2021", format="%m/%d/%Y"),as.Date( "2/7/2021", format="%m/%d/%Y"),as.Date( "2/15/2021", format="%m/%d/%Y")),
     end = c(as.Date( "11/24/2020", format="%m/%d/%Y"),as.Date( "12/2/2020", format="%m/%d/%Y"),as.Date( "12/10/2020", format="%m/%d/%Y"),as.Date( "12/18/2020", format="%m/%d/%Y"),as.Date( "12/26/2020", format="%m/%d/%Y"),as.Date( "1/3/2021", format="%m/%d/%Y"),as.Date( "1/11/2021", format="%m/%d/%Y"),as.Date( "1/19/2021", format="%m/%d/%Y"),as.Date( "1/27/2021", format="%m/%d/%Y"),as.Date( "2/4/2021", format="%m/%d/%Y"),as.Date( "2/12/2021", format="%m/%d/%Y"),as.Date( "2/20/2021", format="%m/%d/%Y")
))

The result should look like this:

    id cond      start        end interval
1  111    a 11/19/2020 11/24/2020        3
2  111    a 11/27/2020  12/2/2020        3
3  111    a  12/5/2020 12/10/2020        3
4  111    a 12/13/2020 12/18/2020       NA
5  222    a 12/21/2020 12/26/2020       NA
6  222    b 12/29/2020   1/3/2021        3
7  222    a   1/6/2021  1/11/2021        3
8  222    a  1/14/2021  1/19/2021       NA
9  333    a  1/22/2021  1/27/2021       NA
10 333    b  1/30/2021   2/4/2021       NA
11 333    b   2/7/2021  2/12/2021       NA
12 333    b  2/15/2021  2/20/2021       NA

So, the condition cond!="b" applies only to the "start" row; i would have hoped that something like this would work:

dt[,interval:= shift(start[cond!="b"],n = 1L,type="lead") -end ,id]

but it doesn't. I've also tried a rolling join:

dt[,row_id:=rowid(id)]
dt[, interval := 
     .SD[.(cond!="b", id = id, row_id = row_id +1), on=.(id,row_id), roll=-Inf, x.start]-end]

which doesn't do what i need either. I feel that data.table should be able to handle this task pretty easily. i just can't figure out how. Any help will be highly appreciated.

Upvotes: 2

Views: 53

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

To get interval value you can subtract the next start date from end date for each id. Turn all the interval values to NA where the next value in cond is 'b'.

library(data.table)

dt[, interval := as.integer(shift(start, type = 'lead') - end), id]
dt[shift(cond, type = 'lead') == 'b', interval := NA]
dt

#     id cond      start        end interval
# 1: 111    a 2020-11-19 2020-11-24        3
# 2: 111    a 2020-11-27 2020-12-02        3
# 3: 111    a 2020-12-05 2020-12-10        3
# 4: 111    a 2020-12-13 2020-12-18       NA
# 5: 222    a 2020-12-21 2020-12-26       NA
# 6: 222    b 2020-12-29 2021-01-03        3
# 7: 222    a 2021-01-06 2021-01-11        3
# 8: 222    a 2021-01-14 2021-01-19       NA
# 9: 333    a 2021-01-22 2021-01-27       NA
#10: 333    b 2021-01-30 2021-02-04       NA
#11: 333    b 2021-02-07 2021-02-12       NA
#12: 333    b 2021-02-15 2021-02-20       NA

Upvotes: 1

Related Questions