Reputation: 580
I need to run a loop over a dataframe with ~150K rows. However, the loop needs to check each row and check a condition that checks every other row in the data set. My code works fine for a toy dataset it produces the correct value but is far far too slow for my actual dataset. I let it run for several hours and it still never finished. so I'm hoping someone has a better idea how to approach this.
#R version 3.5.1 Windows 64-bit
#Example dataset
my_df <- data.frame("PERSON" = c("A","A","A","B","A","A","B"),
"DATE_START" = c("2019-01-15","2019-01-10","2019-01-20","2019-01-19","2018-12-20","2018-03-03","2019-05-01"),
"DATE_FINISH" = c("2019-01-30","2019-01-18","2019-02-05","2019-01-23","2019-02-10","2018-04-01","2019-06-06")
)
#Each row is a task that the assigned person is working on
my_df
PERSON DATE_START DATE_FINISH
1 A 2019-01-15 2019-01-30
2 A 2019-01-10 2019-01-18
3 A 2019-01-20 2019-02-05
4 B 2019-01-19 2019-01-23
5 A 2018-12-20 2019-02-10
6 A 2018-03-03 2018-04-01
7 B 2019-05-01 2019-06-06
What I want to know is FOR row 1, how many other tasks does Person A have overlap between his start and finish dates? (including the row its on)
So the answer I'm looking for is
PERSON DATE_START DATE_FINISH NUMBER_OF_TASKS
1 A 2019-01-15 2019-01-30 4
2 A 2019-01-10 2019-01-18 3
3 A 2019-01-20 2019-02-05 3
4 B 2019-01-19 2019-01-23 1
5 A 2018-12-20 2019-02-10 4
6 A 2018-03-03 2018-04-01 1
7 B 2019-05-01 2019-06-06 1
So this basically says for Row 1, person A had 4 open tasks
I tried created list element for every row that includes the range of dates as numeric values and then to check if there is overlap I used %in% operator to compare the unlisted ranges
I've done something similar using the lapply function(not shown here) but same issue it just takes forever to execute.
##This is what I currently have
temp_list <- list()
num_open_tasks <- c()
open_work_cc <- c()
##Create a list of length = nrow(my_df)
##Each element in the list is a range of dates coerced to numeric
for(i in 1:nrow(my_df)){
temp_list[[i]] <- as.numeric(my_df$DATE_START[i]) :
as.numeric(my_df$DATE_FINISH[i])
}
for(i in 1:nrow(my_df)){
for(j in 1:nrow(my_df)){
##If elements from the temp_list overlap by 5 days, the overlap = 5
##I'm just checking if the overlap is greater than 0 (is there any overlap at all)
##And if the tasks belongs to the same person or not
open_work_cc[j] <- ifelse(sum(unlist(temp_list[[i]]) %in%
unlist(temp_list[[j]])) > 0 &
my_df$PERSON[i] == my_df$PERSON[j]
,1,0
)
open_work_cc_total <- sum(open_work_cc)
}
num_open_tasks[i] <- open_work_cc_total
}
my_df <- cbind(my_df, num_open_tasks)
This method returns my desired column populated with the correct values. But I imagine there is a more elegant and significantly faster method using some form of split/apply/combine. Any and all help is appreciated thanks
Upvotes: 2
Views: 703
Reputation: 5898
foverlaps
in data.table
is probably going to be the fastest approach in R. I think the following code does what you want:
library(data.table)
setDT(my_df)
my_df[, DATE_START_N:=as.numeric(as.Date(DATE_START))]
my_df[, DATE_FINISH_N:=as.numeric(as.Date(DATE_FINISH))]
setkey(my_df, PERSON, DATE_START_N,DATE_FINISH_N)
my_df[,NUMBER_OF_TASKS:=foverlaps(my_df,my_df,which=TRUE)[,.N,by=xid]$N]
my_df
For a bit more clarity: foverlaps(my_df,my_df,which=TRUE)
does a self-join on date-range within PERSON
(the join is determined by setkey
). Note that the default argument for type
of interval join for foverlaps is "any"
which is a partial match on intervals: i.e., what you want here.
Specifying which=TRUE
will just provide the indices of matches in x and y (rather than the actual joined data which is unnecessary here). The call to foverlaps
returns an object of class data.table
which is then immediately aggregated to take the number of rows in each group defined by xid
(which are just rows of my_df
) using the bracket function call [,.N,by=xid]
. These counts are extracted into a vector with $N
and assigned to a new column NUMBER_OF_TASKS
in my_df
.
Upvotes: 5