Reputation: 81
I have two large data sets (500k obs each) which I would like to link via fuzzy string matching on individuals' names but utilizing information on other variables as well. The problem is similar to the one described here: How can I match fuzzy match strings from two datasets?
However, the solution posted there requires to produce all pairwise potential matches first via expand.grid
but with my data this cannot be done. Already if you have two data sets of 10,000 obs each, then this results in a total data set of 100,000,000 potential pairwise matches.
I would like to produce a many-to-one merge first wherein observation $k$ from data set A is matched with the 5 closest matching observations from data set B (as judged by the Jaro Winkler string distance), which fall in a certain age band, say plus/minus 5 years.
For instance, if $k$ in data A is
name birthyear
John Smith 1984
and the other observations in data set B are
serial name birthyear
1 John Smith 1983
2 Sara Pinkert 1973
3 John Smyth 1999
4 John Smithe 1985
5 John Smith 1984
6 Jon Smith 1984
then the five "best" matches for $k$ with observations in data B should be obs number 1, 4, 5, 6 for a birth year restriction of +-5 years. In this case no. 2 (Sara Pinkert) should not match due to the name, and no. 3 (John Smyth) should not match because the birth year for this observation is too late.
Functions and commands provided by other libraries such as fastLink
, stringdist
, or recordLinkage
are good and fast but they always only produce one-to-one matches (and they rarely have the capability of incorporating the information from the birth year band to restrict the dimension of the matching problem).
The closes solution I have been able to figure out so far is using the compare.linkage
function from recordLinkage
but the blocking option (blockfld) seems to block strictly on a particular variable so it's not obvious how a range for the birth year information could be used:
rpairs = compare.linkage(dataA,
dataB,
blockfld = c("birthyear"),
identity1 = dataA$id1,
identity2 = dataB$id2,
n_match = 5,
strcmpfun = jarowinkler)
But this only blocks on perfect birthyear, hence it would return two matches which would be obs no. 5 and 6 (John Smith 1984, Jon Smith 1984).
Here is some sample data for the matching problem. Due to the small size it seems trivial but in the whole sample with half a million obs each (some of which appear in one but not in the other data, and some appearing in both but potentially with typos in their names) it's more tricky.
name1 = c("John Smith", "Adam Bower", "Felix von Epstein", "Charles Sawyer", "Benjamin Hoynes")
yob1 = c(1980, 1977, 1981, 1981, 1978)
dataA = data.frame(name1, yob1)
name2 = c("Jon Smyth", "Perry Bower", "Felix Epstein", "Terry Barnes", "John Smith", "Benamin Hoynes", "Frank Sawyer", "Charles Sawer", "Charles Sauer", "Philip Smith", "Franklin Sawyer", "Jonathan Smith", "Gabriel Bars", "Aron Bow", "Harry Haynes")
yob2 = c(1981, 1983, 1981, 1982, 1983, 1980, 1980, 1986, 1982, 1978, 1977, 1981, 1979, 1975, 1980)
dataB = data.frame(name2, yob2)
Upvotes: 3
Views: 1743
Reputation: 13591
Edited for additional code based on comments
Maybe this will help you
Your data
name1 = c("John Smith", "Adam Bower", "Felix von Epstein", "Charles Sawyer", "Benjamin Hoynes")
yob1 = c(1980, 1977, 1981, 1981, 1978)
dataA = data.frame(name1, yob1)
name2 = c("Jon Smyth", "Perry Bower", "Felix Epstein", "Terry Barnes", "John Smith", "Benamin Hoynes", "Frank Sawyer", "Charles Sawer", "Charles Sauer", "Philip Smith", "Franklin Sawyer", "Jonathan Smith", "Gabriel Bars", "Aron Bow", "Harry Haynes")
yob2 = c(1981, 1983, 1981, 1982, 1983, 1980, 1980, 1986, 1982, 1978, 1977, 1981, 1979, 1975, 1980)
dataB = data.frame(name2, yob2)
Function for approximate string-match and age-band filtering
top_five_amatch <- function(A_row, B) {
require(stringdist)
ans <- intersect(order(stringdist(A_row$name1, dataB$name2, method="jw")), which(abs(A_row$yob1 - dataB$yob2) <= 5))
return(head(ans, 5))
}
At its heart is
library(stringdist)
order(stringdist(dataA$name1[1], dataB$name2, method="jw")) # order of string-distance
# [1] 5 1 12 10 14 7 8 9 6 11 3 2 4 15 13
which(abs(dataA$yob1[1] - dataB$yob2) <= 5) # age band filter
# [1] 1 2 3 4 5 6 7 9 10 11 12 13 14 15
intersect
of the 2 will keep only values that are present after age-band filtering
Main
Grab indexes of nearest match per row of dataA
I <- lapply(seq_len(nrow(dataA)), function(i) top_five_amatch(dataA[i,], dataB))
# [[1]]
# [1] 5 1 12 10 14
# [[2]]
# [1] 14 7 1 4 6
# [[3]]
# [1] 3 1 2 6 11
# [[4]]
# [1] 8 9 7 11 2
# [[5]]
# [1] 6 15 4 2 11
Top 5 matches for each row of dataA
matchB <- dataB[unlist(I), ]
# name2 yob2
# 5 John Smith 1983
# 1 Jon Smyth 1981
# 12 Jonathan Smith 1981
# 10 Philip Smith 1978
# 14 Aron Bow 1975
# 14.1 Aron Bow 1975
# 7 Frank Sawyer 1980
# 1.1 Jon Smyth 1981
# 4 Terry Barnes 1982
# 6 Benamin Hoynes 1980
# 3 Felix Epstein 1981
# 1.2 Jon Smyth 1981
# 2 Perry Bower 1983
# 6.1 Benamin Hoynes 1980
# 11 Franklin Sawyer 1977
# 8 Charles Sawer 1986
# 9 Charles Sauer 1982
# 7.1 Frank Sawyer 1980
# 11.1 Franklin Sawyer 1977
# 2.1 Perry Bower 1983
# 6.2 Benamin Hoynes 1980
# 15 Harry Haynes 1980
# 4.1 Terry Barnes 1982
# 2.2 Perry Bower 1983
# 11.2 Franklin Sawyer 1977
To save in 'wide' format with multiple columns, try something like
matchB <- lapply(I, function(i) dataB[i,])
Reduce("cbind", matchB)
# name2 yob2 name2 yob2 name2 yob2
# 5 John Smith 1983 Frank Sawyer 1980 Felix Epstein 1981
# 1 Jon Smyth 1981 Franklin Sawyer 1977 Benamin Hoynes 1980
# 12 Jonathan Smith 1981 Aron Bow 1975 Perry Bower 1983
# 10 Philip Smith 1978 Benamin Hoynes 1980 Terry Barnes 1982
# 14 Aron Bow 1975 Gabriel Bars 1979 Franklin Sawyer 1977
# name2 yob2 name2 yob2
# 5 Charles Sawer 1986 Benamin Hoynes 1980
# 1 Charles Sauer 1982 Franklin Sawyer 1977
# 12 Franklin Sawyer 1977 Harry Haynes 1980
# 10 Frank Sawyer 1980 Terry Barnes 1982
# 14 Gabriel Bars 1979 Felix Epstein 1981
Upvotes: 4