Jamalan
Jamalan

Reputation: 580

Counting number of partial overlaps for date range intervals

I need to run a loop over a dataframe with ~150K rows. However, the loop needs to check each row and check a condition that checks every other row in the data set. My code works fine for a toy dataset it produces the correct value but is far far too slow for my actual dataset. I let it run for several hours and it still never finished. so I'm hoping someone has a better idea how to approach this.

#R version 3.5.1 Windows 64-bit

#Example dataset
my_df <- data.frame("PERSON" = c("A","A","A","B","A","A","B"),
                    "DATE_START" = c("2019-01-15","2019-01-10","2019-01-20","2019-01-19","2018-12-20","2018-03-03","2019-05-01"),
                    "DATE_FINISH" = c("2019-01-30","2019-01-18","2019-02-05","2019-01-23","2019-02-10","2018-04-01","2019-06-06")
                    )
#Each row is a task that the assigned person is working on
my_df
   PERSON   DATE_START DATE_FINISH
1       A   2019-01-15  2019-01-30
2       A   2019-01-10  2019-01-18
3       A   2019-01-20  2019-02-05
4       B   2019-01-19  2019-01-23
5       A   2018-12-20  2019-02-10
6       A   2018-03-03  2018-04-01
7       B   2019-05-01  2019-06-06

What I want to know is FOR row 1, how many other tasks does Person A have overlap between his start and finish dates? (including the row its on)

So the answer I'm looking for is

   PERSON   DATE_START DATE_FINISH  NUMBER_OF_TASKS
1       A   2019-01-15  2019-01-30  4
2       A   2019-01-10  2019-01-18  3
3       A   2019-01-20  2019-02-05  3
4       B   2019-01-19  2019-01-23  1
5       A   2018-12-20  2019-02-10  4
6       A   2018-03-03  2018-04-01  1
7       B   2019-05-01  2019-06-06  1

So this basically says for Row 1, person A had 4 open tasks

I tried created list element for every row that includes the range of dates as numeric values and then to check if there is overlap I used %in% operator to compare the unlisted ranges

I've done something similar using the lapply function(not shown here) but same issue it just takes forever to execute.

##This is what I currently have

temp_list <- list()
num_open_tasks <- c()
open_work_cc <- c()

##Create a list of length = nrow(my_df)
##Each element in the list is a range of dates coerced to numeric
for(i in 1:nrow(my_df)){
  temp_list[[i]] <- as.numeric(my_df$DATE_START[i]) : 
                    as.numeric(my_df$DATE_FINISH[i])
}


for(i in 1:nrow(my_df)){
  for(j in 1:nrow(my_df)){

##If elements from the temp_list overlap by 5 days, the overlap = 5
##I'm just checking if the overlap is greater than 0 (is there any overlap at all)
##And if the tasks belongs to the same person or not    
open_work_cc[j] <- ifelse(sum(unlist(temp_list[[i]]) %in% 
                              unlist(temp_list[[j]])) > 0 &
                              my_df$PERSON[i] == my_df$PERSON[j] 
                              ,1,0
                           )
open_work_cc_total <- sum(open_work_cc)

  }
  num_open_tasks[i] <- open_work_cc_total

}
my_df <- cbind(my_df, num_open_tasks)

This method returns my desired column populated with the correct values. But I imagine there is a more elegant and significantly faster method using some form of split/apply/combine. Any and all help is appreciated thanks

Upvotes: 2

Views: 703

Answers (1)

Michael
Michael

Reputation: 5898

foverlaps in data.table is probably going to be the fastest approach in R. I think the following code does what you want:

library(data.table)
setDT(my_df)

my_df[, DATE_START_N:=as.numeric(as.Date(DATE_START))]
my_df[, DATE_FINISH_N:=as.numeric(as.Date(DATE_FINISH))]

setkey(my_df, PERSON, DATE_START_N,DATE_FINISH_N)

my_df[,NUMBER_OF_TASKS:=foverlaps(my_df,my_df,which=TRUE)[,.N,by=xid]$N]
my_df

For a bit more clarity: foverlaps(my_df,my_df,which=TRUE) does a self-join on date-range within PERSON (the join is determined by setkey). Note that the default argument for type of interval join for foverlaps is "any" which is a partial match on intervals: i.e., what you want here.

Specifying which=TRUE will just provide the indices of matches in x and y (rather than the actual joined data which is unnecessary here). The call to foverlaps returns an object of class data.table which is then immediately aggregated to take the number of rows in each group defined by xid (which are just rows of my_df) using the bracket function call [,.N,by=xid]. These counts are extracted into a vector with $N and assigned to a new column NUMBER_OF_TASKS in my_df.

Upvotes: 5

Related Questions