Reputation: 5406
I'm trying to cluster some words (let's take car brands). In order to do that I can't use k-means or k-medoids so I've tried to use Affinity Propagation from Sklearn. And I'm using it with levenshtein
from the distance lib or damerau_levenshtein_distance
from the pyxdameraulevenshtein
lib.
Example here : https://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups
However, these metrics are not exactly the ones I need. For example, MERCEDES-BENZ and MERCEDES have a 5 distance, the same as VOLVO and FIAT. Do you guys know some metrics which would give a higher similarity score between MERCEDES-BENZ and MERCEDES than VOLVO and FIAT.
Thanks, Djokx
Upvotes: 0
Views: 374
Reputation: 573
You could use Jaccard similarity from the tri-grams composing those words. That is, you decompose each word in their three-character components (for volvo: vol, olv, lvo) and get their Jaccard similarity to each other set. N-gram.
Jaccard similarity is defined as the ratio between number of common n-grams and number of total n-grams: Jaccard index.
Upvotes: 1