Reputation: 23
I have two datatables, each containing > 50k rows and > 150 columns. I would like to find a way to loop through each row in an ID column in DT1, to see if it falls between or is equal to any row of two other columns in DT2.
If it does fall between two others, I'd like to create a column with the corresponding HighID from DT2. If not, return "False".
For simplicity's sake, I've created a sample of my data tables below:
library(data.table)
DT1 <- data.table(
CheckID = c(52702746325, 14179966429, 52127746410, 13151126534, 52274366388,
21501526375)
)
DT2 <- data.table(
LowID = c(14179966412, 52274366385, 52127746410, 52702746323),
HighID = c(14179966429, 52274366389, 52127746410, 52702746325)
)
I'd like the output of DT1 to look like this:
CheckID HighID
1: 52702746325 52702746325
2: 14179966429 14179966429
3: 52127746410 52127746410
4: 13151126534 False
5: 52274366388 52274366389
6: 21501526375 False
Note that I have a function (below) working with dataframes, however with my full dataset, it takes over 6 hours to complete as I have multiple iterations with multiple checks throughout my script.
factor <- as.character(DT2$HighID)
f <- function(x){
a <- factor[ (DT2$LowID<= x) & (x <= DT2$HighID) ]
if (length(a) == 0) FALSE else max(subset(DT2$HighID, DT2$LowID <= x & DT2$HighID>= x))
DT1$HighID <- sapply(DT1$CheckID, f)
I was hoping there would be a more optimized solution using datatables. All advice is much appreciated. Thank you very much.
Upvotes: 2
Views: 100
Reputation: 17289
As said in comment, this is a regular non-equi join. One way to do it is:
DT1[DT2, HighID := i.HighID, on = .(CheckID >= LowID, CheckID <= HighID)]
# > DT1
# CheckID HighID
# 1: 52702746325 52702746325
# 2: 14179966429 14179966429
# 3: 52127746410 52127746410
# 4: 13151126534 NA
# 5: 52274366388 52274366389
# 6: 21501526375 NA
The on = .(CheckID >= LowID, CheckID <= HighID)
specifies the join condition and the i.HighID
refer to the HighID
column in DT2
.
The NA rows are where you want to put 'False', but you could not do this for a numeric column unless you turn it to a character column.
Upvotes: 4