sandrosil
sandrosil

Reputation: 553

python/sklearn - how to get clusters and cluster names after doing kmeans

So I have the following code where I do a kmeans clustering after doing dimensionality reduction.

# Create CountVectorizer
vec = CountVectorizer(token_pattern=r'[a-z-]+', 
                              ngram_range=(1,1), min_df = 2, max_df = .8,
                              stop_words=ENGLISH_STOP_WORDS)

cv = vec.fit_transform(X)
print('Dimensions: ', cv.shape) 

# Create LSA/TruncatedSVD with full dimensions
cv_lsa = TruncatedSVD(n_components=cv.shape[1]-1)
cv_lsa_data = cv_lsa.fit_transform(cv)

# Find dimensions with 80% variance explained
number = np.searchsorted(cv_lsa.explained_variance_ratio_.cumsum(), .8) + 1
print('Dimensions with 80% variance explained: ',number)

# Create LSA/TruncatedSVD with 80% variance explained
cv_lsa80 = TruncatedSVD(n_components=number)
cv_lsa_data80 = cv_lsa80.fit_transform(cv)

# Do Kmeans when k=4
kmean = KMeans(n_clusters=4)
clustered = km.fit(cv_lsa_data80)

Now I'm stuck on what to do next. I want to get the clusters identified by the kmeans object and get the top 10/most common used word in those clusters. Something like:

Cluster 1:
1st most common word - count
2nd most common word - count

Cluster 2:
1st most common word - count
2nd most common word - count

Upvotes: 0

Views: 1731

Answers (1)

bhuvy
bhuvy

Reputation: 304

If you are looking for cluster center importance, the scikit-learn docs on kmeans says that there is a variable cluster_centers_ of shape [n_clusters, n_features] that can help you out.

km.fit(...)
cluster_centers = km.cluster_centers_
first_cluster = cluster_centers[0] # Cluster 1

But as an addendum to that, I don't think you'll be able to get the counts out directly because you performed an LSA with SVD on the dataset, which means that you don't just have raw counts. You'd have to figure out the most important SVD elements in the kmeans side by magnitude, then figure out which words make up the counts to get your counts. You can use the variable components_ under the SVD class to do that.

Upvotes: 1

Related Questions