Ankhnesmerira
Ankhnesmerira

Reputation: 1430

data.table merge on partial match of different columns in R

this question might have been asked previously but I'm looking for a data.table solution if possible without using other packages. Ive got a data.table DT1 as a reference:

> require(data.table)
> DT1 <- data.table(col1 = c("AA", "BA", "ABC", "ABC BC", "AB")
                  , col2 = c(1,4,5,3,2))
> DT1
     col1 col2
1:     AA    1
2:     BA    4
3:    ABC    5
4: ABC BC    3
5:     AB    2

and I would like to merge a second data.table DT2 with DT1 based on partial matching of col1 in DT1 and col2 in DT2, creating a col3 in DT2.

> DT2 <- data.table(col1 = c(0,5,2,7,1,0)
                  , col2 = c("BA", "ABC", "DC", "AA", "AB", "R AB"))
> DT2
   col1 col2
1:    0   BA
2:    5  ABC
3:    2   DC
4:    7   AA
5:    1   AB
6:    0  R AB

desired output

 > desired_output <- data.table(col1 = c(0,5,5,2,7,1,1,1,0)
                                 , col2 = c("BA", "ABC", "ABC", "DC", "AA",  "AB", "AB", "AB", "R AB")
                                 , col3 = c(4,5,3,NA,1,5,3,2,2))
> desired_output
   col1 col2 col3
1:    0   BA    4
2:    5  ABC    5
3:    5  ABC    3
4:    2   DC   NA
5:    7   AA    1
6:    1   AB    5
7:    1   AB    3
8:    1   AB    2
9:    0  R AB   2

Is there any graceful way to do this using data.table operations? if not then happy to consider other solutions. this is going to be run on a very large dataset.


Edit: to specify the conditions of partial matching, it's a match if either string of col1 in DT1 is a subset of the string of col2 in DT2 or vice versa (the string of col2 in DT2 is a subset of the string of col1 in DT1). A two way grepl?

col1/DT1    col2/DT2
  "AB"       "There is ABhere"    # it's a match
  "ABC"      "someABC"            # it's a match
  "ABC BC"   "ABC"                # it's a reverse match
  "DR"       "ADD"                # no match
  "BA"       "HABAHA"             # two matches

Upvotes: 1

Views: 1578

Answers (1)

chinsoon12
chinsoon12

Reputation: 25225

Given the dimension of the problem (DT1 [(1:50,000), (1:25)] - DT2[(1:50,000,000), (1:55)]), it is probably infeasible to do a CJ of the IDs before doing a two-way grepl.

Breaking down the different kind of matches/approx. matches, we can 1) first look for exact matches, 2) then approx. matches where substring in DT1 can be found in DT2 and then, 3) vice versa.

Finally, we row bind all the results and do a left join between original DT2 and the row-binded results to get desired output.

exactMatches <- DT1[DT2, on=c("ID1"="ID2"), nomatch=0L][,
    ID2 := ID1]

substr1in2 <- DT2[, c(.SD, DT1[grepl(ID2, ID1) & ID1 != ID2]), 
    by=1:DT2[,.N]][!is.na(VAL1), -1L]

substr2in1 <- DT1[, c(.SD, DT2[grepl(ID1, ID2) & ID2 != ID1]), 
    by=1:DT1[,.N]][!is.na(VAL2), -1L]

binded <- rbindlist(list(exactMatches, substr1in2, substr2in1), 
    use.names=TRUE, fill=TRUE)

binded[DT2, on=.(ID2, VAL2)]

output:

       ID1 VAL1 VAL2  ID2
 1:     BA    4    0   BA
 2:    ABC    5    5  ABC
 3: ABC BC    3    5  ABC
 4:     AB    2    5  ABC
 5:   <NA>   NA    2   DC
 6:     AA    1    7   AA
 7:     AB    2    1   AB
 8:    ABC    5    1   AB
 9: ABC BC    3    1   AB
10:     AB    2    0 R AB

I changed some of the column names to make the code more readable. Data:

DT1 <- data.table(ID1 = c("AA", "BA", "ABC", "ABC BC", "AB"), 
    VAL1 = c(1,4,5,3,2))

DT2 <- data.table(VAL2 = c(0,5,2,7,1,0),
    ID2 = c("BA", "ABC", "DC", "AA", "AB", "R AB"))

Upvotes: 2

Related Questions