Reputation: 167
I have two lists, a list of Customers, and a second list of Customers with their Customer number. I'd like to match on the Customer names and return their Customer number.
What are the best ways to match the Customer strings between the two lists ?.
Note the names may not match exactly, so 'Company Name Inc' / 'Company Incorporated' / 'Company-Name Inc' / 'COMPANYNAME Inc' ...
Are there any commands which will provide the best match ?
Thanks Gavin
Upvotes: 1
Views: 830
Reputation: 1076
This is a pretty deep and complicated subject. In addition to Matt Sandgren's answer, you may want to also look at the adist
function which is built in to R and gives Levenshtein distance IIRC. If you're new to string matching stuff, you may want to try a few more things:
If you just want to rank some matches, that's one thing, but if false negatives/positives are an issue then that's a whole other issue! Depends on the problem...
Upvotes: 0
Reputation: 476
For matching strings that aren't exactly the same, I'll point you to the stringdist{}
package. In particular, the amatch
function from that package should be helpful. Here's a link to the documentation:
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf
Some code that would reproduce a small version of your dataframe would be useful, but I've created this short bit of code. It first creates two strings based on a few names you listed. Then I created two dataframes from that.
install.packages("stringdist")
library(stringdist)
names_1 <- c("Apple Ltd", "PearLtd", "Banana Co Ltd")
names_2 <- c("Pear Limited", "Banana ltd", "Appl Ltd")
cust_num <- c(10001, 10002, 10003)
df_1 <- data.frame(names_1)
df_2 <- data.frame(names_2, cust_num)
best_match <- na.omit(amatch(df_2$names_2, df_1$names_1, maxDist = 4))
df_2$cust_num[best_match]
The last line simply outputs a vector of the customer IDs for company names that were found in both lists.
The documentation will explain the parameters for amatch
, but you'll run into problems with maxDist
- set it too low, and your company names won't match. Set it too high, and you'll get false positives. You can see that happening in this example, where only two ID's will be returned, as "Pear Limited" is too far from "PearLtd".
Upvotes: 1