boot-scootin
boot-scootin

Reputation: 12515

Sklearn: find mean centroid location for clusters?

import pandas as pd, numpy as np, scipy
import sklearn.feature_extraction.text as text
from sklearn import decomposition

descs = ["You should not go there", "We may go home later", "Why should we do your chores", "What should we do"]

vectorizer = text.CountVectorizer()

dtm = vectorizer.fit_transform(descs).toarray()

vocab = np.array(vectorizer.get_feature_names())

nmf = decomposition.NMF(3, random_state = 1)

topic = nmf.fit_transform(dtm)

Printing topic leaves me with:

>>> print(topic)
[0.       , 1.403    , 0.     ],
[0.       , 0.       , 1.637  ],
[1.257    , 0.       , 0.     ],
[0.874    , 0.056    , 0.065  ]

Which are vectors of each element in descs's likelihood to belong to a certain cluster. How can I get the coordinates of the centroid of each cluster? Ultimately, I want to develop a function to calculate the distance of each element in descs from the centroid of the cluster it was assigned to.

Would it be best to just compute the average of each descs element's topic value for each cluster?

Upvotes: 1

Views: 1520

Answers (1)

Tonechas
Tonechas

Reputation: 13743

The docs of sklearn.decomposition.NMF explain how to get the coordinates of the centroid of each cluster:

Attributes:     components_ : array, [n_components, n_features]
                            Non-negative components of the data.

The basis vectors are arranged row-wise, as shown in the following interactive session:

In [995]: np.set_printoptions(precision=2)

In [996]: nmf.components_
Out[996]: 
array([[ 0.54,  0.91,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.89,  0.  ,  0.89,  0.37,  0.54,  0.  ,  0.54],
       [ 0.  ,  0.01,  0.71,  0.  ,  0.  ,  0.  ,  0.71,  0.72,  0.71,  0.01,  0.02,  0.  ,  0.71,  0.  ],
       [ 0.  ,  0.01,  0.61,  0.61,  0.61,  0.61,  0.  ,  0.  ,  0.  ,  0.62,  0.02,  0.  ,  0.  ,  0.  ]])

As for your second question, I don't see the point of "computing the average of each descs element's topic value for each cluster". In my opinion it makes more sense to perform the classification through the calculated likelihoods.

Upvotes: 2

Related Questions