Reputation: 25
I am currently trying to cluster a list of sequences based on their similarity using python.
ex:
DFKLKSLFD
DLFKFKDLD
LDPELDKSL
...
The way I pre process my data is by computing the pairwise distances using for example the Levenshtein distance. After calculating all the pairwise distances and creating the distance matrix, I want to use it as input for the clustering algorithm.
I have already tried using Affinity Propagation, but convergence is a bit unpredictable and I would like to go around this problem.
Does anyone have any suggestions regarding other suitable clustering algorithms for this case?
Thank you!!
Upvotes: 1
Views: 2208
Reputation: 1
common_words = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]
for num, centroid in enumerate(common_words):
print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))
Upvotes: 0
Reputation: 1487
sklearn actually does show this example using DBSCAN, just like Luke once answered here.
This is based on that example, using !pip install python-Levenshtein
.
But if you have pre-calculated all distances, you could change the custom metric, as shown below.
from Levenshtein import distance
import numpy as np
from sklearn.cluster import dbscan
data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
def z:
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)
And if you pre-calculated you could define pre_lev_metric(x, y)
along the lines of
def pre_lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return DISTANCES[i,j]
Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.
I had to install like this
!pip uninstall -y enum34
!pip install scikit-learn-extra
Than I was able to create clusters with;
from sklearn_extra.cluster import KMedoids
import numpy as np
from Levenshtein import distance
data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
def lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)
The labels/centers are in
kmedoids.labels_
kmedoids.cluster_centers_
Upvotes: 1
Reputation: 20302
Try this.
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Results:
- *LDPELDKSL:* LDPELDKSL
- *DFKLKSLFD:* DFKLKSLFD
- *XYZ:* ABC, XYZ
- *DLFKFKDLD:* DLFKFKDLD
Upvotes: 0