Reputation: 2556
I am trying to group a list of words by similarities. I find this interesting issue related to this topic and try to implement the Affinity Propagation algorithm in my words array however I find the output very poor.
How can I improve this algorithm ?
import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
import itertools
words = np.array(['ice cubes',
'club soda',
'white rum',
'lime',
'turbinado',
'egg',
'hearts of palm',
'cilantro',
'coconut cream',
'flax seed meal',
'kosher salt',
'jalapeno chilies',
'garlic',
'cream cheese soften',
'coconut oil',
'lime juice',
'crushed red pepper flakes',
'ground coriander',
'pepper',
'chicken breasts',
'coconut flour',
'onion',
'sweetened condensed milk',
'butter',
'cocoa powder',
'lime',
'crushed ice',
'simple syrup',
'cachaca',
'sugar',
'corn starch',
'egg whites',
'boiling water',
'cold water',
'egg yolks',
'sweetened condensed milk',
'milk',
'jell-o gelatin dessert',
'olive oil',
'low sodium chicken broth',
'cilantro leaves',
'chile powder',
'fresh thyme',
'chile pepper',
'sweet paprika',
'sablefish',
'brown rice',
'yellow onion',
'low-fat coconut milk',
'roma tomatoes',
'garlic',
'fresh lime juice',
'egg',
'grating cheese',
'milk',
'tapioca flour',
'salt',
'olive oil',
'coconut milk',
'frozen banana',
'pure acai puree',
'almond butter',
'kosher salt',
'dijon mustard',
'sweet paprika',
'boneless skinless chicken breast halves',
'caraway seeds',
'ground black pepper',
'lime wedges',
'chopped cilantro',
'lager beer',
'peeled fresh ginger',
'garlic cloves',
'green bell pepper',
'unsalted butter',
'vegetable oil',
'onion',
'egg',
'whole milk',
'extra-virgin olive oil',
'garlic cloves',
'corn kernels',
'chicken breasts',
'all-purpose flour',
'cream cheese soften',
'celery ribs'])
print("calculating distances...")
(dim,) = words.shape
f = lambda x_y: -leven.distance(x_y[0],x_y[1])
res=np.fromiter(map(f, itertools.product(words, words)), dtype=np.uint8)
A = np.reshape(res,(dim,dim))
af = AffinityPropagation().fit(A)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
# Distances had to be converted to similarities, I did that by taking the negative of distance. The output is
unique_labels = set(labels)
for i in unique_labels:
print(words[labels==i])
Here is the output I get :
calculating distances...
['lime' 'lime']
['egg' 'egg' 'egg']
['cream cheese soften' 'cream cheese soften']
['sweetened condensed milk' 'sweetened condensed milk']
['milk' 'milk']
['olive oil']
['turbinado' 'hearts of palm' 'pepper' 'crushed ice' 'sugar' 'egg whites'
'boiling water' 'egg yolks' 'jell-o gelatin dessert' 'cilantro leaves'
'chile powder' 'fresh thyme' 'chile pepper' 'sablefish' 'brown rice'
'low-fat coconut milk' 'fresh lime juice' 'grating cheese' 'tapioca flour'
'dijon mustard' 'caraway seeds' 'lime wedges' 'lager beer'
'peeled fresh ginger' 'green bell pepper' 'vegetable oil'
'all-purpose flour']
['garlic' 'garlic']
['olive oil']
['kosher salt' 'kosher salt']
['sweet paprika' 'sweet paprika']
['garlic cloves' 'garlic cloves']
['ice cubes' 'club soda' 'white rum' 'cilantro' 'coconut cream'
'flax seed meal' 'jalapeno chilies' 'coconut oil' 'lime juice'
'crushed red pepper flakes' 'ground coriander' 'coconut flour' 'onion'
'butter' 'cocoa powder' 'simple syrup' 'cachaca' 'corn starch'
'cold water' 'low sodium chicken broth' 'yellow onion' 'roma tomatoes'
'salt' 'coconut milk' 'frozen banana' 'pure acai puree' 'almond butter'
'boneless skinless chicken breast halves' 'ground black pepper'
'chopped cilantro' 'unsalted butter' 'onion' 'whole milk'
'extra-virgin olive oil' 'corn kernels' 'celery ribs']
['chicken breasts' 'chicken breasts']
As you can see the grouping is not so great. I would expect all words with 'salt' for example being grouped together. Similarly, i would expect 'fresh lime juice' be grouped with 'lime' or 'lime wedges'
Thanks
Upvotes: 1
Views: 2775
Reputation: 77454
You use Levenshtein distance on characters.
To turn "salt example" into "salt somethingelse", it needs to delete character-by-character most of the string, then re-add the remainder.
In other words, your distance function does not satsify your desires.
You should try to first figure out the appropriate distance measure. this could be e.g. longest common substring based (salt A
and salt B
have a substring of length 5 in common), or yould also treat all words independently, use normalized levenshtein averaged over all words, etc.
Definitely spend more time on the distance function.
Upvotes: 0
Reputation: 4128
You are grouping sentences instead of words. Depending on how you want to bias this, you could split each sentence into words and calculate the score between sentences to be:
Or for each word in the shorter sentence, find the minimum Levenshtein distance from words of the longer sentence, the score is then:
There are many other possibilities as well.
Upvotes: 1
Reputation: 2034
Your words are not words I think, rather sentences. You could maybe check parts of these first to pre-group.
Upvotes: 1