wetcoaster
wetcoaster

Reputation: 367

Exclude intervals that overlap between two data frame's (by range of two column values)

This is almost an extension of a previous question I asked, but I've run into a new problem I haven't found a solution for.

Here is the original question and answer: Find matching intervals in data frame by range of two column values

(this found overlapping intervals that were common among different names within same data frame)

I now want to find a way to exclude row's in DF1 when there are overlapping intervals with a new data-frame, DF2.

Using the same DF1 :

Name     Event Order     Sequence     start_event     end_event     duration     Group 
JOHN     1               A               0               19          19           ID1
JOHN     2               A               60              112         52           ID1  
JOHN     3               A               392             429         37           ID1  
JOHN     4               B               282             329         47           ID1
JOHN     5               C               147             226         79           ID1  
JOHN     6               C               566             611         45           ID1  
ADAM     1               A               19              75          56           ID1
ADAM     2               A               384             407         23           ID1  
ADAM     3               B               0               79          79           ID1  
ADAM     4               B               505             586         81           ID1
ADAM     5               C               140             205         65           ID1  
ADAM     6               C               522             599         77           ID1  

This continues for 18 different names and two ID groups.

Now have a second data frame with intervals that I wish to exclude from the above data frame.

Here is an example of DF2:

Name     Event Order     Sequence     start_event     end_event     duration     Group 
GAP1     1               A               55               121         66           ID1
GAP2     2               A               394              419         25           ID1  
GAP3     3               C               502              635         133          ID1  

I.E., I am hoping to find any interval for each "Name" in DF1, that is in the same "Sequence" and has overlapping time at any point of the interval found in DF2 (any portion, whether it begins before the start event, or begins midway and ends after the end event). I would like to iterate through each distinct "Name" in DF1. Also, the sequence matters, so I would only like to return results found common between sequence A and sequence A, then sequence B and sequence B, and finally sequence C and sequence C.

Desired Result (showing just the first name):

Name     Event Order     Sequence     start_event     end_event     duration     Group 
JOHN     1               A               0               19          19           ID1
JOHN     4               B               282             329         47           ID1
JOHN     5               C               147             226         79           ID1  
ADAM     3               B               0               79          79           ID1  
ADAM     4               B               505             586         81           ID1
ADAM     5               C               140             205         65           ID1  

Last time the answer was resolved in part with foverlaps, but I am still not overly familiar with it to be able to solve this problem - assuming that's the best way to answer this.

Thanks!

Upvotes: 1

Views: 481

Answers (1)

nbafrank
nbafrank

Reputation: 126

This piece of code should work for you

library(data.table)

Dt1 <- data.table(a = 1:1000,b=1:1000 + 100)
Dt2 <- data.table(a = 100:200,b=100:200+10)

#identify the positions that are not allowed
badSeq <- unique(unlist(lapply(1:nrow(Dt2),function(i) Dt2[i,a:b,])))

#select for the rows outside of the range
correctPos <- sapply(1:nrow(Dt1),
                 function(i)
                   all(!Dt1[i,a:b %in% badSeq]))


Dt1[correctPos,]

I have done it with data.tables rather than data.frames. I like them better and they can be faster. But you can apply the same ideas to a data.frame

Upvotes: 2

Related Questions