TerribleStudent
TerribleStudent

Reputation: 61

Match two large datasets in R using fuzzy matching

I have two large datasets that I want to match/merge, the problem is that I don't have any exact matches and thus left to fuzzy matching of their names (which can vary from one character to a three word name etc).

One of the sets contains approx 220k observations that I need to match with a "solution"-dataset that contains 1.2m observations. I expect that around half (~100k) of the observations at best should be able to match. But I would be happy with 50k.

However, I haven't found any good way of doing this since I'm getting cannot allocate vector errors when using the stringdist_join function from the fuzzyjoin package etc.

Any advice? Thanks!

Sincerely,

Upvotes: 0

Views: 455

Answers (1)

GKi
GKi

Reputation: 39647

Simple example using adist per single element to save memory.

s1 <- c("abc", "abd")
s2 <- c("xy", "ad", "bc")

di <- c(adist(s1[1], s2))
idx <- rep(1, length(s2))
for(i in 2:length(s1)) {
  tt <- c(adist(s1[i], s2))
  j <- which(tt < di)
  di[j] <- tt[j]
  idx[j] <- i
}
data.frame(s2, bestMatch=s1[idx], distance = di)
#  s2 bestMatch distance
#1 xy       abc        3
#2 ad       abd        1
#3 bc       abc        1

Upvotes: 1

Related Questions