Varun S.
Varun S.

Reputation: 23

How do I check if each row in a column falls between a range of two other columns?

I have two datatables, each containing > 50k rows and > 150 columns. I would like to find a way to loop through each row in an ID column in DT1, to see if it falls between or is equal to any row of two other columns in DT2.

If it does fall between two others, I'd like to create a column with the corresponding HighID from DT2. If not, return "False".

For simplicity's sake, I've created a sample of my data tables below:

library(data.table)

DT1 <- data.table(
CheckID = c(52702746325, 14179966429, 52127746410, 13151126534, 52274366388, 
21501526375)
)

DT2 <- data.table(
LowID = c(14179966412, 52274366385, 52127746410, 52702746323),
HighID = c(14179966429, 52274366389, 52127746410, 52702746325)
)

I'd like the output of DT1 to look like this:

       CheckID      HighID
1: 52702746325 52702746325
2: 14179966429 14179966429
3: 52127746410 52127746410
4: 13151126534       False
5: 52274366388 52274366389
6: 21501526375       False

Note that I have a function (below) working with dataframes, however with my full dataset, it takes over 6 hours to complete as I have multiple iterations with multiple checks throughout my script.

factor <- as.character(DT2$HighID)
f <- function(x){
  a <- factor[ (DT2$LowID<= x) & (x <= DT2$HighID) ] 
  if (length(a) == 0) FALSE else max(subset(DT2$HighID, DT2$LowID <= x & DT2$HighID>= x))

DT1$HighID <- sapply(DT1$CheckID, f)

I was hoping there would be a more optimized solution using datatables. All advice is much appreciated. Thank you very much.

Upvotes: 2

Views: 100

Answers (1)

mt1022
mt1022

Reputation: 17289

As said in comment, this is a regular non-equi join. One way to do it is:

DT1[DT2, HighID := i.HighID, on = .(CheckID >= LowID, CheckID <= HighID)]
# > DT1
# CheckID      HighID
# 1: 52702746325 52702746325
# 2: 14179966429 14179966429
# 3: 52127746410 52127746410
# 4: 13151126534          NA
# 5: 52274366388 52274366389
# 6: 21501526375          NA

The on = .(CheckID >= LowID, CheckID <= HighID) specifies the join condition and the i.HighID refer to the HighID column in DT2.

The NA rows are where you want to put 'False', but you could not do this for a numeric column unless you turn it to a character column.

Upvotes: 4

Related Questions