user98235
user98235

Reputation: 896

Efficiently joining two data tables with a condition

One data table (let's call is A) contains the ID numbers:

ID
3
5
12
8
...

and another table (let's call it B) contains the lower bound and the upper bound and the name for that ID.

ID_lower   ID_upper     Name
 1            4         James
 5            7         Arthur
 8            11        Jacob
 12           13        Sarah

so based on table B, given the ID from table A, we can find the matching name by finding the name on the row in table B such that

ID_lower <= ID <= ID upper

and I wanna create a table of ID and Name, so in the above example, it would be

ID           Name
3            James
5            Arthur
12           Sarah
8            Jacob
...          ...

I used for loop, so that for each row of A, I look for the row in B such that ID is between the ID_lower and ID_upper for that row and joined the name from there. However, this method was a bit slow. Is there a fast way of doing it in R?

Upvotes: 0

Views: 120

Answers (4)

Arun
Arun

Reputation: 118779

Using the new non-equi joins feature in the current development version of data.table, this is straightforward:

require(data.table) # v1.9.7+
dt2[dt1, .(ID, Name), on=.(ID_lower <= ID, ID_upper >= ID)]

See the installation instructions for devel version here.


where,

dt1=fread('ID
            3
            5
           12
            8')
dt2 = fread('ID_lower   ID_upper     Name
                    1         4     James
                    5         7    Arthur
                    8        11     Jacob
                   12        13     Sarah')

Upvotes: 2

bgoldst
bgoldst

Reputation: 35314

I believe findInterval() on ID_lower might be the ideal approach here:

A[,Name:=B[findInterval(ID,ID_lower),Name]];
A;
##    ID   Name
## 1:  3  James
## 2:  5 Arthur
## 3: 12  Sarah
## 4:  8  Jacob

This will only be correct if (1) B is sorted by ID_lower and (2) all values in A$ID are covered by the ranges in B.

Upvotes: 1

akuiper
akuiper

Reputation: 214927

You can try this data.table solution:

data.table::setDT(B)[, .(Name, ID = Map(`:`, ID_lower, ID_upper))]
                    [, .(ID = unlist(ID)), .(Name)][ID %in% A$ID]

     Name ID
1:  James  3
2: Arthur  5
3:  Sarah 12
4:  Jacob  8

Upvotes: 1

HubertL
HubertL

Reputation: 19544

You can make a look-up table with your second data.frame (B):

lu <- do.call(rbind,
              apply(B,1,function(x) 
                    data.frame(ID=c(x[1]:x[2]),Name=x[3], row.names = NULL)))

then you query it with your first data.frame (A):

A$Name <- lu[A$ID,"Name"]

Upvotes: 1

Related Questions