Reputation: 3488
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic
package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Upvotes: 1
Views: 1832
Reputation: 1074
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)
Upvotes: 4