Reputation: 1
I am trying to fuzzy match two different dataframes based on company names, using the agrep function. To improve my matching, I would like to only match companies if they are located in the same country.
df1: df2:
Company ISO Company ISO
Aalberts Industries NL Aalberts NL
Allison NL Allison transmission NL
Allison UK Allison transmission UK
I use the following function to match:
testb$test <- ""
for(i in 1:dim(testb)[1]) {x2 <- agrep(testb$name[i], testa$name, ignore.case=TRUE, value=TRUE, max.distance = Inf, useBytes = TRUE, fixed = TRUE)
x2 <- paste0(x2,"")
testb$test2[i] <- x2
}
I can create a subset for every country and than match each subset, which works, but is time consuming. Is there another way to let R only match company names if df1$ISO = df2$ISO? Thanks!
Upvotes: 0
Views: 282
Reputation: 803
Try indexing with the data.table
package: https://www.r-bloggers.com/intro-to-the-data-table-package/.
Your company columns seem to be too dissimilar to match consistently and accurately with agrep()
. For example, "Aalberts Industries" will match "Aalberts" only when you set max.distance
to a value greater than 10. The same string distance would also report a match between "Algebra" and "Alleyway" — not very close at all. I recommend cleaning out the unnecessary words in your company columns before matching.
Sorry, I would make this a comment, but I don't have the required reputation. Maybe someone could convert this to a comment for me?
Upvotes: 1