KRDH
KRDH

Reputation: 1

Fuzzy matching by category

I am trying to fuzzy match two different dataframes based on company names, using the agrep function. To improve my matching, I would like to only match companies if they are located in the same country.

 df1:                             df2:
 Company               ISO        Company                ISO
 Aalberts Industries   NL         Aalberts               NL
 Allison               NL         Allison transmission   NL
 Allison               UK         Allison transmission   UK

I use the following function to match:

testb$test <- ""
for(i in 1:dim(testb)[1]) {x2 <- agrep(testb$name[i], testa$name, ignore.case=TRUE, value=TRUE, max.distance = Inf, useBytes = TRUE, fixed = TRUE)
                     x2 <- paste0(x2,"")
                     testb$test2[i] <- x2
}

I can create a subset for every country and than match each subset, which works, but is time consuming. Is there another way to let R only match company names if df1$ISO = df2$ISO? Thanks!

Upvotes: 0

Views: 282

Answers (1)

coletl
coletl

Reputation: 803

Try indexing with the data.table package: https://www.r-bloggers.com/intro-to-the-data-table-package/.

Your company columns seem to be too dissimilar to match consistently and accurately with agrep(). For example, "Aalberts Industries" will match "Aalberts" only when you set max.distance to a value greater than 10. The same string distance would also report a match between "Algebra" and "Alleyway" — not very close at all. I recommend cleaning out the unnecessary words in your company columns before matching.

Sorry, I would make this a comment, but I don't have the required reputation. Maybe someone could convert this to a comment for me?

Upvotes: 1

Related Questions