Repeated subsetting can you make this process faster

Question

I need to loop through data table2 DT2 and for each row check if there is a match on some conditions in data table1 DT1.

I currently do this like this. This looks in DT1 for the data I have in DT2. In the loop you will see:

  subset = DT1[DT1$time == DT2$time[i] & DT1$a == DT2$a[i] & DT1$b == DT2$b[i] & (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] )  & DT1$Flag ==0]

If there is a match I need to

1 - flag the row that matched in DT1 so I do not match on it again. In the code that is done with DT1$Flag[match$RowNumber] = 1

2 - populate a column in DT2 with data from the corresponding column in DT1 In the code that is done with

  DT2$x[i]  = match$x

That's pretty much it and this works BUT DT1 might be 10,000 rows and DT2 might be 100,000 rows so for each 10,000 rows I am subsetting. Subsetting 10,000 times!

There is also a second requirement which is the loop has a match condition of the matching EXACTLY on time. In the code you see

DT1$time == DT2$time[i]

After that matching is done. that time condition is relaxed and a second loop is run that has a time window of +/- 7 seconds so in the second loop you may find a match in DT1 that is +/- 7 seconds away from the time in DT2.

DT1[  DT1$time >= DT2$time[i] -7 & DT1$time <= DT2$time[i]+7

This works as well but again because the data tables have some many rows the subsetting takes a very long time. Can these 2 loops be made faster somehow?

Notice the OR condition in the subset which makes this tricky.

 (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] )

Sample code is below which you can run. Thank you.

############# Here is the setup of the datatables
times= rep(as.POSIXct("2016-01-01",tz="GMT")+seq(1,10,by = 1),2)
times= times[order(times)]
DT1 = data.table(time = times, a = c(1,seq(1,19,1)) , b = c(11,seq(11,29,1)) , cat= c("a","a",  rep(c("a","b"),each=9) ) ,Flag =rep(0,20)  ,x = seq(201,220,1) )
DT1$RowNumber = seq(1,dim(DT1)[1],1)
DT2 = data.table(time = as.POSIXct(c("2016-01-01 00:00:01","2016-01-01 00:00:10","2016-01-01 00:00:10"),tz = "GMT"), a = c(1,19,10),b=c(11,29,20), cat1 = c("a","x","b"), x = c(0,0,0),MatchType = c("none","none","none"), cat2=c("a","b","a"))


 ######### This is the for loop that does the matching
#If there is a match i.e. dim(subset)[1]>0 two things happen
# 2 - flag the row used in DT1 so it is not used again...notice DT1$Flag ==0 is used in the subset
# 1 - populate column x in dt2 from column x in dt1
for(i in 1:dim(DT2)[1])#loop over rows of dt2
{
  #i =1
  subset = DT1[DT1$time == DT2$time[i] & DT1$a == DT2$a[i] & DT1$b == DT2$b[i] & (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] )  & DT1$Flag ==0] #lookin dt2 for the dt1 data
  if(dim(subset)[1]>0)
  {
    match = head(subset,1) # if there are multiple matches only use the 1st one
    DT1$Flag[match$RowNumber] = 1 #flag the row used in DT1 so it is not used again
    DT2$MatchType[i]  = "First Loop"#populate column x in dt2 from column x in dt1
    DT2$x[i]  = match$x #populate column x in dt2 from column x in dt1
  }
}
##### after that loop some rows in DT2 will not have a match. In this case the last row has HasAMatch = 0
DT2 # NOTE HERE that  the last row has Match Type = none because a match could not be found
DT1 # NOTE the flag column has a 1 in the first and last rows which was set in the loop when the match occured

##### Now a second loop is done this time trying to match within a time window +-7 seconds instead of a matching EXACTLEY on time
firstloop = DT2[DT2$MatchType != "none",] ### this removes any of the rows ALREADY MATCHED IN THE FIRST LOOP
DT2 =DT2[DT2$MatchType == "none",] ### this is used in the loop below and has the rows that have NOT been matched yet
DT1 = DT1[DT1$Flag == 0,] ## this again removes rows from DT1 that have already been matched
DT1$RowNumber = seq(1,dim(DT1)[1],1)
for(i in 1:dim(DT2)[1])#loop over rows of dt2
{
  i=1
  subset = DT1[  DT1$time >= DT2$time[i] -7 & DT1$time <= DT2$time[i]+7  & DT1$a == DT2$a[i] & DT1$b == DT2$b[i] & (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] )  & DT1$Flag ==0] #lookin dt2 for the dt1 data
  if(dim(subset)[1]>0)
  {
    match = head(subset,1) # if there are multiple matches only use the 1st one
    DT1$Flag[match$RowNumber] = 1 #flag the row used in DT1 so it is not used again
    DT2$MatchType[i]  = "Second Loop" #populate column x in dt2 from column x in dt1
    DT2$x[i]  = match$x #populate column x in dt2 from column x in dt1
  }
}

# now the process is finished
rbind(firstloop, DT2) # NOTE now you can see the match type of "second loop" for the last row
DT1 # NOTE the flag in row 10 because that was the row used in the match

Jaap · Accepted Answer

First I should mention that you have to try to prevent from using function names af variable names in your datasets: cat and subset are a functions in R, therefore I used cat0 instead of cat and subs instead of subset in this answer. There are several possible improvements to your code:

Creation of the sample data:

Especially create the RowNumber variable can be done more efficient with .I. Furthermore, I've also given DT2 a rownumber variable as this is usefull in the next steps:

times <- rep(as.POSIXct("2016-01-01",tz="GMT") + seq(1,10,by = 1), 2)
times <- times[order(times)]

DT1 <- data.table(time = times, 
                  a = c(1,1:19), 
                  b = c(11,11:29), 
                  cat0 = c("a","a", rep(c("a","b"), each=9)),
                  Flag = rep(0,20),
                  x = seq(201,220,1))[, rn := .I]

DT2 <- data.table(time = as.POSIXct(c("2016-01-01 00:00:01","2016-01-01 00:00:10","2016-01-01 00:00:10"), tz="GMT"), 
                  a = c(1,19,10),
                  b = c(11,29,20), 
                  cat1 = c("a","x","b"), 
                  x = c(0,0,0),
                  MatchType = c("none","none","none"), 
                  cat2 = c("a","b","a"))[, rn := .I]

The first loop:

This can be simplified by taking out the updating of DT1 and making use of the update by reference possibility of the data.table package with := (which will probably be much more efficient than doing it inside the for-loop):

for(i in 1:nrow(DT2))
{
  subs <- DT1[time == DT2$time[i] & 
                a == DT2$a[i] & 
                b == DT2$b[i] & 
                (cat0 == DT2$cat1[i] | cat0 == DT2$cat2[i])
              & Flag == 0] 
  if(nrow(subs) > 0)
  {
    DT2[i, `:=` (MatchType = 'First Loop', x = subs$x[1])]
  }
}

DT1 can then be updated by creating an index and again using := to update by reference:

idx1 <- DT1[(time %in% DT2$time) & (a %in% DT2$a) & (b %in% DT2$b) & 
              (cat0 %in% DT2$cat1 | cat0 %in% DT2$cat2) & (Flag == 0), 
            .SD[1], 
            .(time,a,b,cat0,Flag)]$rn
DT1[idx1, Flag := 1]

The second loop:

for(i in DT2[MatchType == "none"]$rn)   # here we need the rownumber variable for DT2
{
  subs <- DT1[time >= DT2$time[i]-7 & 
                time <= DT2$time[i]+7 & 
                a == DT2$a[i] & 
                b == DT2$b[i] & 
                (cat0 == DT2$cat1[i] | cat0 == DT2$cat2[i] )  
              & Flag == 0]
  if(nrow(subs) > 0)
  {
    DT1[subs$rn[1], Flag := 2]
    DT2[i, `:=` (MatchType = 'Second Loop', x = subs$x[1])]
  }
}

These improvements remove the need for creating the intermediate subset and rbind steps. The final result:

> DT2
                  time  a  b cat1   x   MatchType cat2 rn
1: 2016-01-01 00:00:01  1 11    a 201  First Loop    a  1
2: 2016-01-01 00:00:10 19 29    x 220  First Loop    b  2
3: 2016-01-01 00:00:10 10 20    b 211 Second Loop    a  3

> DT1
                   time  a  b cat0 Flag   x rn
 1: 2016-01-01 00:00:01  1 11    a    1 201  1
 2: 2016-01-01 00:00:01  1 11    a    0 202  2
 3: 2016-01-01 00:00:02  2 12    a    0 203  3
 4: 2016-01-01 00:00:02  3 13    a    0 204  4
 5: 2016-01-01 00:00:03  4 14    a    0 205  5
 6: 2016-01-01 00:00:03  5 15    a    0 206  6
 7: 2016-01-01 00:00:04  6 16    a    0 207  7
 8: 2016-01-01 00:00:04  7 17    a    0 208  8
 9: 2016-01-01 00:00:05  8 18    a    0 209  9
10: 2016-01-01 00:00:05  9 19    a    0 210 10
11: 2016-01-01 00:00:06 10 20    a    2 211 11
12: 2016-01-01 00:00:06 11 21    b    0 212 12
13: 2016-01-01 00:00:07 12 22    b    0 213 13
14: 2016-01-01 00:00:07 13 23    b    0 214 14
15: 2016-01-01 00:00:08 14 24    b    0 215 15
16: 2016-01-01 00:00:08 15 25    b    0 216 16
17: 2016-01-01 00:00:09 16 26    b    0 217 17
18: 2016-01-01 00:00:09 17 27    b    0 218 18
19: 2016-01-01 00:00:10 18 28    b    0 219 19
20: 2016-01-01 00:00:10 19 29    b    1 220 20

Repeated subsetting can you make this process faster

Answers (1)

Related Questions