Andy
Andy

Reputation: 81

Matching two data sets via fuzzy many-to-one string match in R

I have two large data sets (500k obs each) which I would like to link via fuzzy string matching on individuals' names but utilizing information on other variables as well. The problem is similar to the one described here: How can I match fuzzy match strings from two datasets?

However, the solution posted there requires to produce all pairwise potential matches first via expand.grid but with my data this cannot be done. Already if you have two data sets of 10,000 obs each, then this results in a total data set of 100,000,000 potential pairwise matches.

I would like to produce a many-to-one merge first wherein observation $k$ from data set A is matched with the 5 closest matching observations from data set B (as judged by the Jaro Winkler string distance), which fall in a certain age band, say plus/minus 5 years.

For instance, if $k$ in data A is

name          birthyear
John Smith    1984

and the other observations in data set B are

serial   name           birthyear
1        John Smith     1983
2        Sara Pinkert   1973
3        John Smyth     1999
4        John Smithe    1985
5        John Smith     1984
6        Jon Smith      1984

then the five "best" matches for $k$ with observations in data B should be obs number 1, 4, 5, 6 for a birth year restriction of +-5 years. In this case no. 2 (Sara Pinkert) should not match due to the name, and no. 3 (John Smyth) should not match because the birth year for this observation is too late.

Functions and commands provided by other libraries such as fastLink, stringdist, or recordLinkage are good and fast but they always only produce one-to-one matches (and they rarely have the capability of incorporating the information from the birth year band to restrict the dimension of the matching problem).

The closes solution I have been able to figure out so far is using the compare.linkage function from recordLinkage but the blocking option (blockfld) seems to block strictly on a particular variable so it's not obvious how a range for the birth year information could be used:

rpairs = compare.linkage(dataA, 
                     dataB, 
                     blockfld = c("birthyear"), 
                     identity1 = dataA$id1, 
                     identity2 = dataB$id2, 
                     n_match = 5, 
                     strcmpfun = jarowinkler)

But this only blocks on perfect birthyear, hence it would return two matches which would be obs no. 5 and 6 (John Smith 1984, Jon Smith 1984).

Here is some sample data for the matching problem. Due to the small size it seems trivial but in the whole sample with half a million obs each (some of which appear in one but not in the other data, and some appearing in both but potentially with typos in their names) it's more tricky.

name1 = c("John Smith", "Adam Bower", "Felix von Epstein", "Charles Sawyer", "Benjamin Hoynes")
yob1 = c(1980, 1977, 1981, 1981, 1978)
dataA = data.frame(name1, yob1)

name2 = c("Jon Smyth", "Perry Bower", "Felix Epstein", "Terry Barnes", "John Smith", "Benamin Hoynes", "Frank Sawyer", "Charles Sawer", "Charles Sauer", "Philip Smith", "Franklin Sawyer", "Jonathan Smith", "Gabriel Bars", "Aron Bow", "Harry Haynes")
yob2 = c(1981, 1983, 1981, 1982, 1983, 1980, 1980, 1986, 1982, 1978, 1977, 1981, 1979, 1975, 1980)
dataB = data.frame(name2, yob2)

Upvotes: 3

Views: 1743

Answers (1)

CPak
CPak

Reputation: 13591

Edited for additional code based on comments

Maybe this will help you

Your data

name1 = c("John Smith", "Adam Bower", "Felix von Epstein", "Charles Sawyer", "Benjamin Hoynes")
yob1 = c(1980, 1977, 1981, 1981, 1978)
dataA = data.frame(name1, yob1)

name2 = c("Jon Smyth", "Perry Bower", "Felix Epstein", "Terry Barnes", "John Smith", "Benamin Hoynes", "Frank Sawyer", "Charles Sawer", "Charles Sauer", "Philip Smith", "Franklin Sawyer", "Jonathan Smith", "Gabriel Bars", "Aron Bow", "Harry Haynes")
yob2 = c(1981, 1983, 1981, 1982, 1983, 1980, 1980, 1986, 1982, 1978, 1977, 1981, 1979, 1975, 1980)
dataB = data.frame(name2, yob2)

Function for approximate string-match and age-band filtering

top_five_amatch <- function(A_row, B) {
                require(stringdist)
                ans <- intersect(order(stringdist(A_row$name1, dataB$name2, method="jw")), which(abs(A_row$yob1 - dataB$yob2) <= 5))
                return(head(ans, 5))
            }

At its heart is

library(stringdist)
order(stringdist(dataA$name1[1], dataB$name2, method="jw"))     # order of string-distance
# [1]  5  1 12 10 14  7  8  9  6 11  3  2  4 15 13

which(abs(dataA$yob1[1] - dataB$yob2) <= 5)                     # age band filter
# [1]  1  2  3  4  5  6  7  9 10 11 12 13 14 15 

intersect of the 2 will keep only values that are present after age-band filtering


Main
Grab indexes of nearest match per row of dataA

I <- lapply(seq_len(nrow(dataA)), function(i) top_five_amatch(dataA[i,], dataB))
# [[1]]
# [1]  5  1 12 10 14

# [[2]]
# [1] 14  7  1  4  6

# [[3]]
# [1]  3  1  2  6 11

# [[4]]
# [1]  8  9  7 11  2

# [[5]]
# [1]  6 15  4  2 11

Top 5 matches for each row of dataA

matchB <- dataB[unlist(I), ]
               # name2 yob2
# 5         John Smith 1983
# 1          Jon Smyth 1981
# 12    Jonathan Smith 1981
# 10      Philip Smith 1978
# 14          Aron Bow 1975
# 14.1        Aron Bow 1975
# 7       Frank Sawyer 1980
# 1.1        Jon Smyth 1981
# 4       Terry Barnes 1982
# 6     Benamin Hoynes 1980
# 3      Felix Epstein 1981
# 1.2        Jon Smyth 1981
# 2        Perry Bower 1983
# 6.1   Benamin Hoynes 1980
# 11   Franklin Sawyer 1977
# 8      Charles Sawer 1986
# 9      Charles Sauer 1982
# 7.1     Frank Sawyer 1980
# 11.1 Franklin Sawyer 1977
# 2.1      Perry Bower 1983
# 6.2   Benamin Hoynes 1980
# 15      Harry Haynes 1980
# 4.1     Terry Barnes 1982
# 2.2      Perry Bower 1983
# 11.2 Franklin Sawyer 1977

To save in 'wide' format with multiple columns, try something like

matchB <- lapply(I, function(i) dataB[i,])
Reduce("cbind", matchB)
            # name2 yob2           name2 yob2           name2 yob2
# 5      John Smith 1983    Frank Sawyer 1980   Felix Epstein 1981
# 1       Jon Smyth 1981 Franklin Sawyer 1977  Benamin Hoynes 1980
# 12 Jonathan Smith 1981        Aron Bow 1975     Perry Bower 1983
# 10   Philip Smith 1978  Benamin Hoynes 1980    Terry Barnes 1982
# 14       Aron Bow 1975    Gabriel Bars 1979 Franklin Sawyer 1977
             # name2 yob2           name2 yob2
# 5    Charles Sawer 1986  Benamin Hoynes 1980
# 1    Charles Sauer 1982 Franklin Sawyer 1977
# 12 Franklin Sawyer 1977    Harry Haynes 1980
# 10    Frank Sawyer 1980    Terry Barnes 1982
# 14    Gabriel Bars 1979   Felix Epstein 1981

Upvotes: 4

Related Questions