user6899412
user6899412

Reputation: 41

Identifying group specific timed events within a defined period

I have a dataset that records categorised events which have two time points t1 and t2, where t2 is aways >= t1. Events are categorised z1, z2, z3 or NA (uncategorised). For each group (grp) i want to identify events where t1 is within a specified period of any categorised t2 event in the group. In my example i've referred to these t2 events as reference dates. The loop procedure shows what i require but is hopelessly inefficient over a large dataset. I require this to run on a dataset comprising several million rows with more than one million groups.

I have also shown my attempts to code this more efficiently using data.table syntax. My method is to assign the set of reference dates for each group to the vector "ref", then for each row in the group calculate difference between t1 and the reference dates and test these interval(s) against a specified interval, this then returns a single boolean indicating whether the row specific t1 is within 30 days of any of the reference dates. When i restrict each group to a single reference date (by taking the first date [1]) the code works, but when i allow multiple reference dates, as required, the code returns errors. Clearly I am not understanding what data table is doing within the j statement. Could someone explain what i've got wrong and suggest an efficient data.table solution.

example data

library("data.table")
DT <-read.table(text=
"grp,zcat,t1,t2
a,NA,2007-03-18,2007-03-28
a,z1,2007-08-04,2007-08-14
a,NA,2007-08-21,2007-08-23
a,NA,2007-11-21,2007-11-29
a,z1,2007-12-10,2007-12-13
a,z2,2008-02-16,2008-02-19
a,NA,2008-03-14,2008-03-21
a,NA,2008-05-27,2008-06-03
b,NA,2003-04-22,2003-04-27
b,z3,2003-05-11,2003-05-23
b,z1,2003-07-16,2003-08-02
c,z3,2011-01-18,2011-02-07
c,z3,2011-03-01,2011-03-13
c,NA,2011-03-30,2011-04-11
c,NA,2011-05-21,2011-05-28",
header=TRUE, sep=",", stringsAsFactors=FALSE, na.strings="NA", colClasses="character")
DT <-data.table(DT)
setorder(DT,grp,t1)

reference dates by group

grp-a: "2007-08-14" "2007-12-13" "2008-02-19"

grp-b: "2003-05-23" "2003-08-02"

grp-c: "2011-02-07" "2011-03-13"

loop procedure - ok

out<-c()
for(i in 1:nrow(DT)){
    ref <-DT[grp == grp[i] & !is.na(zcat),t2]
    temp <-as.Date(DT$t1[i]) - as.Date(ref)
    out[i] <-any(temp >=0 & temp <31)
    rm(ref,temp)
    # ref; delta; delta >=0 & delta <31
    if(i==nrow(DT)){DT[, newvar :=out]; rm(out)}
}

data.table coding attempts

the first two examples work ok but use a single reference data for each group, the third example uses the same principle with all reference dates and fails. The problem seems to be in how j statement is handling multiple interval values

DT[,{ref=t2[!is.na(zcat)]; delta=as.Date(t1) - as.Date(ref)[1]; delta >0 & delta <30}, by=grp]

DT[,{ref=t2[!is.na(zcat)][1]; delta=as.Date(t1) - as.Date(ref); delta >0 & delta <30}, by=grp]

DT[,{ref=t2[!is.na(zcat)]; delta=as.Date(t1) - as.Date(ref); any(delta >0 & delta <30)}, by=grp]

Upvotes: 4

Views: 157

Answers (1)

eddi
eddi

Reputation: 49448

Using the latest version (1.9.8+):

DT[, `:=`(t1 = as.Date(t1), t2 = as.Date(t2), newvar = FALSE)]

DT[DT[!is.na(zcat), .(grp, t2, t2.end = t2 + 31)],
   on = .(grp, t1 >= t2, t1 < t2.end),
   newvar := TRUE]

Upvotes: 2

Related Questions