mantunes
mantunes

Reputation: 25

Text data clustering with python

I am currently trying to cluster a list of sequences based on their similarity using python.

ex:

DFKLKSLFD

DLFKFKDLD

LDPELDKSL
...

The way I pre process my data is by computing the pairwise distances using for example the Levenshtein distance. After calculating all the pairwise distances and creating the distance matrix, I want to use it as input for the clustering algorithm.

I have already tried using Affinity Propagation, but convergence is a bit unpredictable and I would like to go around this problem.

Does anyone have any suggestions regarding other suitable clustering algorithms for this case?

Thank you!!

Upvotes: 1

Views: 2208

Answers (3)

common_words = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

Upvotes: 0

Willem Hendriks
Willem Hendriks

Reputation: 1487

sklearn actually does show this example using DBSCAN, just like Luke once answered here.

This is based on that example, using !pip install python-Levenshtein. But if you have pre-calculated all distances, you could change the custom metric, as shown below.

from Levenshtein import distance

import numpy as np
from sklearn.cluster import dbscan

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def z:
    i, j = int(x[0]), int(y[0])     # extract indices
    return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

dbscan(X, metric=lev_metric, eps=5, min_samples=2)

And if you pre-calculated you could define pre_lev_metric(x, y) along the lines of

def pre_lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return DISTANCES[i,j]

Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.

I had to install like this

!pip uninstall -y enum34
!pip install scikit-learn-extra

Than I was able to create clusters with;

from sklearn_extra.cluster import KMedoids
import numpy as np
from Levenshtein import distance

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)
kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)

The labels/centers are in

kmedoids.labels_
kmedoids.cluster_centers_

Upvotes: 1

ASH
ASH

Reputation: 20302

Try this.

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
    
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

Results:

 - *LDPELDKSL:* LDPELDKSL
 - *DFKLKSLFD:* DFKLKSLFD
 - *XYZ:* ABC, XYZ
 - *DLFKFKDLD:* DLFKFKDLD

Upvotes: 0

Related Questions