Reputation: 61
I have two large datasets that I want to match/merge, the problem is that I don't have any exact matches and thus left to fuzzy matching of their names (which can vary from one character to a three word name etc).
One of the sets contains approx 220k observations that I need to match with a "solution"-dataset that contains 1.2m observations. I expect that around half (~100k) of the observations at best should be able to match. But I would be happy with 50k.
However, I haven't found any good way of doing this since I'm getting cannot allocate vector errors when using the stringdist_join function from the fuzzyjoin package etc.
Any advice? Thanks!
Sincerely,
Upvotes: 0
Views: 455
Reputation: 39647
Simple example using adist
per single element to save memory.
s1 <- c("abc", "abd")
s2 <- c("xy", "ad", "bc")
di <- c(adist(s1[1], s2))
idx <- rep(1, length(s2))
for(i in 2:length(s1)) {
tt <- c(adist(s1[i], s2))
j <- which(tt < di)
di[j] <- tt[j]
idx[j] <- i
}
data.frame(s2, bestMatch=s1[idx], distance = di)
# s2 bestMatch distance
#1 xy abc 3
#2 ad abd 1
#3 bc abc 1
Upvotes: 1