Kushagra Bhatia
Kushagra Bhatia

Reputation: 113

Clustering words with Kmeans

How do I cluster terms (1-2 words) using Kmeans. I read a research paper where they had used K Means to cluster similar terms using Levenshtein Distance. Please help me by showing a sample code.

Thank you

Note: In the research paper they had computed the similarity matrix using Levenshtein Distance and used it for clustering.

https://ieeexplore.ieee.org/document/7765062/

Upvotes: 0

Views: 2449

Answers (3)

A matrix is initialized measuring in the (m,n)-cell the Levenshtein distance between the m-character prefix of one with the n-prefix of the other word. The matrix can be filled from the upper left to the lower right corner.

Upvotes: 0

Mehdi
Mehdi

Reputation: 4318

from nltk.metrics import distance
import scipy.spatial as spatial
import numpy as np
from scipy.cluster.vq import kmeans

# sample vocabulary list
words = ['test', 'text', 'best', 'fast', 'context', 'boost', 'faster', 'border']

# similarity matrix
word_vectors = np.array([
    [
        distance.edit_distance(w, _w)
        for _w in words
    ]
    for w in words
], dtype=np.float)

centroids, _ = kmeans(word_vectors, k_or_guess=3)

word_clusters = np.argmin([
    [spatial.distance.euclidean(wv, cv) for cv in centroids]
    for wv in word_vectors
], 1)

for k in range(centroids.shape[0]):
    print('k =', k)
    print([word for i, word in enumerate(words) if word_clusters[i] == k])

It results as:

k = 0
['faster', 'border']
k = 1
['test', 'text', 'best', 'fast', 'boost']
k = 2
['context']

Remarks:

  1. Original vocabulary works as a feature list. The list of distance measures to other words works as a feature vector to any phrase or word.
  2. Each cluster is made in such feature space. Consequently, the distance between two words is not their Levenshtein Distance anymore but it is their distance in such space. This is why we use other measures such as spatial.distance.euclidean.
  3. Kmean produces centroids in this feature space, each word is considered as a member to a cluster if the cluster centroid is the closest to the word (out of all other centroids). np.argmin([...], 1) is finding such assignment for each word.
  4. Other clustering algorithms can be also tested on word-feature space. (List of some clustering algorithms in scikit-learn: https://scikit-learn.org/stable/modules/clustering.html)

Upvotes: 1

def get_levenshtein_distance(word1, word2):
word2 = word2.lower()
word1 = word1.lower()
matrix = [[0 for x in range(len(word2) + 1)] for x in range(len(word1) + 1)]

for x in range(len(word1) + 1):
    matrix[x][0] = x
for y in range(len(word2) + 1):
    matrix[0][y] = y

print(matrix)

Upvotes: 0

Related Questions