Bernd Weiss
Bernd Weiss

Reputation: 947

Find overlapping intervals

Context: As far as I can see, R lacks consistent functions which facilitate the data preparation in the context of survival/event history analysis, e.g. episode-splitting to include time-varying covariates (sometimes refered to as 'counting process data').

For each individual (id), the start (start.cp) and end time (stop.cp) of each episode is given. Furthermore, for each of the 1,2, ..., p time-varying covariates (TVC), we know when the episode starts (tvc.start_) and when it ends (tvc.stop_).

In my example (see below) the number of TVCs is 2 but usually the number can vary (from 1 to p).

Example:

Input data:

  id start.cp stop.cp tvc.start1 tvc.start2 tvc.stop1 tvc.stop2
1  1        1       2          2          3         4         7
2  1        2       3          2          3         4         7
3  1        3       4          2          3         4         7
4  1        4       7          2          3         4         7
5  1        7      12          2          3         4         7

structure(list(id = c(1, 1, 1, 1, 1), start.cp = c(1, 2, 3, 4, 
7), stop.cp = c(2, 3, 4, 7, 12), tvc.start1 = c(2, 2, 2, 2, 2
), tvc.start2 = c(3, 3, 3, 3, 3), tvc.stop1 = c(4, 4, 4, 4, 4
), tvc.stop2 = c(7, 7, 7, 7, 7)), .Names = c("id", "start.cp", 
"stop.cp", "tvc.start1", "tvc.start2", "tvc.stop1", "tvc.stop2"), 
row.names = c(NA, 5L), class = "data.frame")

The names of the TVCs are known, i.e. in this example it is known that

tvc.start <- c("tvc.start1", "tvc.start2") 
tvc.stop <- c("tvc.stop1", "tvc.stop2")

Expected results:

  id start.cp stop.cp tvc.start1 tvc.start2 tvc.stop1 tvc.stop2 tvc.d1 tvc.d2
1  1        1       2          2          3         4         7      0      0
2  1        2       3          2          3         4         7      1      0
3  1        3       4          2          3         4         7      1      0
4  1        4       7          2          3         4         7      0      1
5  1        7      12          2          3         4         7      0      1

structure(list(id = c(1, 1, 1, 1, 1), start.cp = c(1, 2, 3, 4, 
7), stop.cp = c(2, 3, 4, 7, 12), tvc.start1 = c(2, 2, 2, 2, 2
), tvc.start2 = c(3, 3, 3, 3, 3), tvc.stop1 = c(4, 4, 4, 4, 4
), tvc.stop2 = c(7, 7, 7, 7, 7), tvc.d1 = c(0, 1, 1, 0, 0), tvc.d2 = c(0, 
0, 0, 1, 1)), .Names = c("id", "start.cp", "stop.cp", "tvc.start1", 
"tvc.start2", "tvc.stop1", "tvc.stop2", "tvc.d1", "tvc.d2"), row.names = c(NA, 
5L), class = "data.frame")

Question: For each TVC, I would like to create a new vector (tvc.d1, tvc.d2, see example) which indicates that a given episode (defined by start.cp and stop.cp) overlaps (=1) the interval of a TVC. It is assumed that [start.cp, stop.cp). How can this be done without looping over the set of TVCs, i.e. I am looking for a vectorized solution.

P.S.: Please feel free to change the title...

Upvotes: 1

Views: 1725

Answers (1)

IRTFM
IRTFM

Reputation: 263471

I think Terry Therneau might want to dispute your claim, The tcut function and the pyearsin the recommended survival package are described early in his technical article with Cindy Crowson on handling time-dependent covariates. I had trouble understanding why should tcv.d1 be contributing exposure during the interval 2 -> 3 when its stop time was 2? But the explanation for later readers is in the comments to the question.

You really only need the start.cp stop.cp vectors and the first line as input data. You compare the interval defining vector to the vector of each component/indivdiual's start and stop vector and find the intervals that == '1's. I'm ondering if the data doesn't really come in this way and you might not need to do the duplication of start and stop times in your setup.

tvec <- with(dat, c(start.cp[1], stop.cp))
dat$tvc.d1 <- 1*( findInterval(tvec,      # the "1*" converts to numeric
                               as.numeric( dat[ 1, c("tvc.start1", "tvc.stop1")]) ,  
                               all.inside=FALSE)[1:5] == 1)
dat$tvc.d2 <- 1*( findInterval(tvec, 
                               as.numeric( dat[ 1, c("tvc.start2", "tvc.stop2")]) ,  
                               all.inside=FALSE)[1:5] == 1)

Upvotes: 1

Related Questions