Reputation: 1095
I have two data.tables. Each has a column called 'firstName' and another called 'lastName', which contain some values which will match each other and some that won't. Some values in both data sets might be duplicated.
I want to add a new column to the second data.table, in which I will store the indices of matches from the first data set for each element of 'firstName' within the second data set. I will then repeat the whole matching process with the 'lastName' column and get the intersect of index matches for 'firstName' and 'lastName'. I will then use the intersect of the indices to fetch the case ID (cid) from the first data set and append it to the second data set.
Because there might be more than one match per element, I will store them as lists within my data.table. I cannot use base::match
function because it will only return the first match for each element, but I do need the answer to be vectorised in just the same way as the match function.
I've tried different combinations of which(d1$x %in% y)
but this does not work either because it matches for all of y at once instead of one element at a time. I am using data.table because for my real-world use case, the data set to match on could be hundreds of thousands of records, so speed is important.
I have found a related question here, but I can't quite figure out how to efficiently convert this to data.table syntax.
Here is some example data:
# Load library
library(data.table)
# First data set (lookup table):
dt1 <- data.table(cid = c("c1", "c2", "c3", "c4", "c5"),
firstName = c("Jim", "Joe", "Anne", "Jim", "Anne"),
lastName = c("Gracea", "Ali", "Mcfee", "Dutto", "Crest"))
# Second data set (data to match with case IDs from lookup table):
dt2 <- data.table(lid = c(1, 2, 3, 4),
firstName = c("Maria", "Jim", "Jack", "Anne"),
lastName = c("Antonis", "Dutto", "Blogs", "Mcfee"),
result = c("pos", "neg", "neg", "pos"))
My desired output would look like this:
# Output:
> dt2
lid firstName lastName result fn_match ln_match casematch caseid
1: 1 Maria Antonis pos NA NA NA <NA>
2: 2 Jim Dutto neg 1,4 4 4 c4
3: 3 Jack Blogs neg NA NA NA <NA>
4: 4 Anne Mcfee pos 3,5 3 3 c3
Upvotes: 4
Views: 1231
Reputation: 1095
In my real life scenario, I need to retrieve the indices for matches on more than one column. I found a way to do this in one step by combining some of the other solutions and figured it would be useful to also share this and the explanation of how it works below.
The code below adds a new column caseid
to dt2
, which gets its values from the column cid
in dt1
for the row indices that matched on both firstName and lastName.
Putting dt1 inside the square brackets and specifying on = .(...)
is equivalent to merging dt1 with dt2 on firstName and lastName, but instead of merging all columns from both datasets, one new column called caseid
is created.
The lower case i.
prefix to cid
indicates that cid
is a column from the second data set (dt1
).
The upper case .I
inside the square brackets after i.cid
will retrieve the row indices of dt1 that match dt2 on firstName and lastName.
# Get case IDs from dt1 for matches of firstName and lastName in one step:
dt2[dt1, caseid := i.cid[.I], on = .(firstName, lastName)]
# Output:
> dt2
lid firstName lastName result caseid
1: 1 Maria Antonis pos <NA>
2: 2 Jim Dutto neg c4
3: 3 Jack Blogs neg <NA>
4: 4 Anne Mcfee pos c3
Upvotes: 0
Reputation: 67828
Using .EACHI
and adding the resulting list column by reference.
dt2[ , res := dt1[ , i := .I][.SD, on = .(firstName), .(.(i)), by = .EACHI]$V1]
# lid firstName res
# 1: 1 Maria NA
# 2: 2 Jim 1,4
# 3: 3 Jack NA
# 4: 4 Anne 3,5
Upvotes: 3
Reputation: 887951
We could use
library(data.table)
dt1[dt2, .(casematch = toString(cid), lid),on = .(firstName), by = .EACHI]
-output
firstName casematch lid
<char> <char> <num>
1: Maria NA 1
2: Jim c1, c4 2
3: Jack NA 3
4: Anne c3, c5 4
Or with row index
dt1[dt2, .(casematch = na_if(toString(.I), 0), lid),on = .(firstName), by = .EACHI]
firstName casematch lid
<char> <char> <num>
1: Maria <NA> 1
2: Jim 1, 4 2
3: Jack <NA> 3
4: Anne 3, 5 4
Upvotes: 3
Reputation: 41260
A possible solution:
dt1[,id:=seq_along(cid)]
dt1[dt2,.(lid,id,firstName = i.firstName),on=.(firstName)][
,.(casematch =.( id)),by=.(lid,firstName)]
lid firstName casematch
<num> <char> <list>
1: 1 Maria NA
2: 2 Jim 1,4
3: 3 Jack NA
4: 4 Anne 3,5
Upvotes: 4
Reputation: 102880
Another data.table
option
> dt1[, .(cid = toString(cid)), firstName][dt2, on = .(firstName)]
firstName cid lid
1: Maria <NA> 1
2: Jim c1, c4 2
3: Jack <NA> 3
4: Anne c3, c5 4
Upvotes: 1