Andrew Taylor
Andrew Taylor

Reputation: 3488

Grouping words that are similar

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')

I want to get either:

CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow

But would be absolutely fine with:

CompanyName2
1
1
1
2
2
3
3

I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.

I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?

I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.

Upvotes: 1

Views: 1832

Answers (1)

user3554004
user3554004

Reputation: 1074

Solution: use string distance

You're on the right track. Here is some R code to get you started:

install.packages("stringdist") # install this package
library("stringdist") 
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs") 

Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)

sdm[1:5,1:5]
            kraft kraft foods kfraft nestle nestle usa
kraft           0           6      1      9         13
kraft foods     6           0      7     15         15
kfraft          1           7      0     10         14
nestle          9          15     10      0          4
nestle usa     13          15     14      4          0

Some visualization

# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))

If you want to group then explicitly into k groups, use k-medoids.

library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

enter image description here

Upvotes: 4

Related Questions