Reputation: 9018
I need to loop through data table2 DT2 and for each row check if there is a match on some conditions in data table1 DT1.
I currently do this like this. This looks in DT1 for the data I have in DT2. In the loop you will see:
subset = DT1[DT1$time == DT2$time[i] & DT1$a == DT2$a[i] & DT1$b == DT2$b[i] & (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] ) & DT1$Flag ==0]
If there is a match I need to
1 - flag the row that matched in DT1 so I do not match on it again. In the code that is done with DT1$Flag[match$RowNumber] = 1
2 - populate a column in DT2 with data from the corresponding column in DT1 In the code that is done with
DT2$x[i] = match$x
That's pretty much it and this works BUT DT1 might be 10,000 rows and DT2 might be 100,000 rows so for each 10,000 rows I am subsetting. Subsetting 10,000 times!
There is also a second requirement which is the loop has a match condition of the matching EXACTLY on time. In the code you see
DT1$time == DT2$time[i]
After that matching is done. that time condition is relaxed and a second loop is run that has a time window of +/- 7 seconds so in the second loop you may find a match in DT1 that is +/- 7 seconds away from the time in DT2.
DT1[ DT1$time >= DT2$time[i] -7 & DT1$time <= DT2$time[i]+7
This works as well but again because the data tables have some many rows the subsetting takes a very long time. Can these 2 loops be made faster somehow?
Notice the OR condition in the subset which makes this tricky.
(DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] )
Sample code is below which you can run. Thank you.
############# Here is the setup of the datatables
times= rep(as.POSIXct("2016-01-01",tz="GMT")+seq(1,10,by = 1),2)
times= times[order(times)]
DT1 = data.table(time = times, a = c(1,seq(1,19,1)) , b = c(11,seq(11,29,1)) , cat= c("a","a", rep(c("a","b"),each=9) ) ,Flag =rep(0,20) ,x = seq(201,220,1) )
DT1$RowNumber = seq(1,dim(DT1)[1],1)
DT2 = data.table(time = as.POSIXct(c("2016-01-01 00:00:01","2016-01-01 00:00:10","2016-01-01 00:00:10"),tz = "GMT"), a = c(1,19,10),b=c(11,29,20), cat1 = c("a","x","b"), x = c(0,0,0),MatchType = c("none","none","none"), cat2=c("a","b","a"))
######### This is the for loop that does the matching
#If there is a match i.e. dim(subset)[1]>0 two things happen
# 2 - flag the row used in DT1 so it is not used again...notice DT1$Flag ==0 is used in the subset
# 1 - populate column x in dt2 from column x in dt1
for(i in 1:dim(DT2)[1])#loop over rows of dt2
{
#i =1
subset = DT1[DT1$time == DT2$time[i] & DT1$a == DT2$a[i] & DT1$b == DT2$b[i] & (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] ) & DT1$Flag ==0] #lookin dt2 for the dt1 data
if(dim(subset)[1]>0)
{
match = head(subset,1) # if there are multiple matches only use the 1st one
DT1$Flag[match$RowNumber] = 1 #flag the row used in DT1 so it is not used again
DT2$MatchType[i] = "First Loop"#populate column x in dt2 from column x in dt1
DT2$x[i] = match$x #populate column x in dt2 from column x in dt1
}
}
##### after that loop some rows in DT2 will not have a match. In this case the last row has HasAMatch = 0
DT2 # NOTE HERE that the last row has Match Type = none because a match could not be found
DT1 # NOTE the flag column has a 1 in the first and last rows which was set in the loop when the match occured
##### Now a second loop is done this time trying to match within a time window +-7 seconds instead of a matching EXACTLEY on time
firstloop = DT2[DT2$MatchType != "none",] ### this removes any of the rows ALREADY MATCHED IN THE FIRST LOOP
DT2 =DT2[DT2$MatchType == "none",] ### this is used in the loop below and has the rows that have NOT been matched yet
DT1 = DT1[DT1$Flag == 0,] ## this again removes rows from DT1 that have already been matched
DT1$RowNumber = seq(1,dim(DT1)[1],1)
for(i in 1:dim(DT2)[1])#loop over rows of dt2
{
i=1
subset = DT1[ DT1$time >= DT2$time[i] -7 & DT1$time <= DT2$time[i]+7 & DT1$a == DT2$a[i] & DT1$b == DT2$b[i] & (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] ) & DT1$Flag ==0] #lookin dt2 for the dt1 data
if(dim(subset)[1]>0)
{
match = head(subset,1) # if there are multiple matches only use the 1st one
DT1$Flag[match$RowNumber] = 1 #flag the row used in DT1 so it is not used again
DT2$MatchType[i] = "Second Loop" #populate column x in dt2 from column x in dt1
DT2$x[i] = match$x #populate column x in dt2 from column x in dt1
}
}
# now the process is finished
rbind(firstloop, DT2) # NOTE now you can see the match type of "second loop" for the last row
DT1 # NOTE the flag in row 10 because that was the row used in the match
Upvotes: 2
Views: 104
Reputation: 83215
First I should mention that you have to try to prevent from using function names af variable names in your datasets: cat
and subset
are a functions in R, therefore I used cat0
instead of cat
and subs
instead of subset
in this answer. There are several possible improvements to your code:
Creation of the sample data:
Especially create the RowNumber
variable can be done more efficient with .I
. Furthermore, I've also given DT2
a rownumber variable as this is usefull in the next steps:
times <- rep(as.POSIXct("2016-01-01",tz="GMT") + seq(1,10,by = 1), 2)
times <- times[order(times)]
DT1 <- data.table(time = times,
a = c(1,1:19),
b = c(11,11:29),
cat0 = c("a","a", rep(c("a","b"), each=9)),
Flag = rep(0,20),
x = seq(201,220,1))[, rn := .I]
DT2 <- data.table(time = as.POSIXct(c("2016-01-01 00:00:01","2016-01-01 00:00:10","2016-01-01 00:00:10"), tz="GMT"),
a = c(1,19,10),
b = c(11,29,20),
cat1 = c("a","x","b"),
x = c(0,0,0),
MatchType = c("none","none","none"),
cat2 = c("a","b","a"))[, rn := .I]
The first loop:
This can be simplified by taking out the updating of DT1
and making use of the update by reference possibility of the data.table package with :=
(which will probably be much more efficient than doing it inside the for-loop):
for(i in 1:nrow(DT2))
{
subs <- DT1[time == DT2$time[i] &
a == DT2$a[i] &
b == DT2$b[i] &
(cat0 == DT2$cat1[i] | cat0 == DT2$cat2[i])
& Flag == 0]
if(nrow(subs) > 0)
{
DT2[i, `:=` (MatchType = 'First Loop', x = subs$x[1])]
}
}
DT1
can then be updated by creating an index and again using :=
to update by reference:
idx1 <- DT1[(time %in% DT2$time) & (a %in% DT2$a) & (b %in% DT2$b) &
(cat0 %in% DT2$cat1 | cat0 %in% DT2$cat2) & (Flag == 0),
.SD[1],
.(time,a,b,cat0,Flag)]$rn
DT1[idx1, Flag := 1]
The second loop:
for(i in DT2[MatchType == "none"]$rn) # here we need the rownumber variable for DT2
{
subs <- DT1[time >= DT2$time[i]-7 &
time <= DT2$time[i]+7 &
a == DT2$a[i] &
b == DT2$b[i] &
(cat0 == DT2$cat1[i] | cat0 == DT2$cat2[i] )
& Flag == 0]
if(nrow(subs) > 0)
{
DT1[subs$rn[1], Flag := 2]
DT2[i, `:=` (MatchType = 'Second Loop', x = subs$x[1])]
}
}
These improvements remove the need for creating the intermediate subset and rbind
steps. The final result:
> DT2
time a b cat1 x MatchType cat2 rn
1: 2016-01-01 00:00:01 1 11 a 201 First Loop a 1
2: 2016-01-01 00:00:10 19 29 x 220 First Loop b 2
3: 2016-01-01 00:00:10 10 20 b 211 Second Loop a 3
> DT1
time a b cat0 Flag x rn
1: 2016-01-01 00:00:01 1 11 a 1 201 1
2: 2016-01-01 00:00:01 1 11 a 0 202 2
3: 2016-01-01 00:00:02 2 12 a 0 203 3
4: 2016-01-01 00:00:02 3 13 a 0 204 4
5: 2016-01-01 00:00:03 4 14 a 0 205 5
6: 2016-01-01 00:00:03 5 15 a 0 206 6
7: 2016-01-01 00:00:04 6 16 a 0 207 7
8: 2016-01-01 00:00:04 7 17 a 0 208 8
9: 2016-01-01 00:00:05 8 18 a 0 209 9
10: 2016-01-01 00:00:05 9 19 a 0 210 10
11: 2016-01-01 00:00:06 10 20 a 2 211 11
12: 2016-01-01 00:00:06 11 21 b 0 212 12
13: 2016-01-01 00:00:07 12 22 b 0 213 13
14: 2016-01-01 00:00:07 13 23 b 0 214 14
15: 2016-01-01 00:00:08 14 24 b 0 215 15
16: 2016-01-01 00:00:08 15 25 b 0 216 16
17: 2016-01-01 00:00:09 16 26 b 0 217 17
18: 2016-01-01 00:00:09 17 27 b 0 218 18
19: 2016-01-01 00:00:10 18 28 b 0 219 19
20: 2016-01-01 00:00:10 19 29 b 1 220 20
Upvotes: 1