Reputation: 12515
import pandas as pd, numpy as np, scipy
import sklearn.feature_extraction.text as text
from sklearn import decomposition
descs = ["You should not go there", "We may go home later", "Why should we do your chores", "What should we do"]
vectorizer = text.CountVectorizer()
dtm = vectorizer.fit_transform(descs).toarray()
vocab = np.array(vectorizer.get_feature_names())
nmf = decomposition.NMF(3, random_state = 1)
topic = nmf.fit_transform(dtm)
Printing topic
leaves me with:
>>> print(topic)
[0. , 1.403 , 0. ],
[0. , 0. , 1.637 ],
[1.257 , 0. , 0. ],
[0.874 , 0.056 , 0.065 ]
Which are vectors of each element in descs
's likelihood to belong to a certain cluster. How can I get the coordinates of the centroid of each cluster? Ultimately, I want to develop a function to calculate the distance of each element in descs
from the centroid of the cluster it was assigned to.
Would it be best to just compute the average of each descs
element's topic
value for each cluster?
Upvotes: 1
Views: 1520
Reputation: 13743
The docs of sklearn.decomposition.NMF
explain how to get the coordinates of the centroid of each cluster:
Attributes: components_ : array, [n_components, n_features]
Non-negative components of the data.
The basis vectors are arranged row-wise, as shown in the following interactive session:
In [995]: np.set_printoptions(precision=2)
In [996]: nmf.components_
Out[996]:
array([[ 0.54, 0.91, 0. , 0. , 0. , 0. , 0. , 0.89, 0. , 0.89, 0.37, 0.54, 0. , 0.54],
[ 0. , 0.01, 0.71, 0. , 0. , 0. , 0.71, 0.72, 0.71, 0.01, 0.02, 0. , 0.71, 0. ],
[ 0. , 0.01, 0.61, 0.61, 0.61, 0.61, 0. , 0. , 0. , 0.62, 0.02, 0. , 0. , 0. ]])
As for your second question, I don't see the point of "computing the average of each descs
element's topic value for each cluster". In my opinion it makes more sense to perform the classification through the calculated likelihoods.
Upvotes: 2