Reputation: 15
I need to run K-means clustering algorithm to cluster textual data but by using cosine distance measure instead of Euclidean distance. Any reliable implementation of this in python?
Edit:
I have tried to use NLTK as following:
NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=
nltk.cluster.util.cosine_distance, repeats=25)
clstr = kclusterer.cluster(X, clusters=False, trace=False)
print (clstr)
But it gives me error:
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
X here is a TF-IDF matrix of shape (15, 155).
Upvotes: 0
Views: 1498
Reputation: 1008
If you want to do it yourself: https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
just change the distance measruing entry. The distance measuring is in the for loop over i
of the pseudo code.
Upvotes: 1
Reputation: 2525
You can use NLTK for this. The K-means from NLTK allows you to specify which measure of distance you want to use.
Upvotes: 0