RobinFrcd
RobinFrcd

Reputation: 5406

Words clustering

I'm trying to cluster some words (let's take car brands). In order to do that I can't use k-means or k-medoids so I've tried to use Affinity Propagation from Sklearn. And I'm using it with levenshtein from the distance lib or damerau_levenshtein_distance from the pyxdameraulevenshtein lib.

Example here : https://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups

However, these metrics are not exactly the ones I need. For example, MERCEDES-BENZ and MERCEDES have a 5 distance, the same as VOLVO and FIAT. Do you guys know some metrics which would give a higher similarity score between MERCEDES-BENZ and MERCEDES than VOLVO and FIAT.

Thanks, Djokx

Upvotes: 0

Views: 374

Answers (1)

shirowww
shirowww

Reputation: 573

You could use Jaccard similarity from the tri-grams composing those words. That is, you decompose each word in their three-character components (for volvo: vol, olv, lvo) and get their Jaccard similarity to each other set. N-gram.

Jaccard similarity is defined as the ratio between number of common n-grams and number of total n-grams: Jaccard index.

Upvotes: 1

Related Questions