Reputation: 553
So I have the following code where I do a kmeans clustering after doing dimensionality reduction.
# Create CountVectorizer
vec = CountVectorizer(token_pattern=r'[a-z-]+',
ngram_range=(1,1), min_df = 2, max_df = .8,
stop_words=ENGLISH_STOP_WORDS)
cv = vec.fit_transform(X)
print('Dimensions: ', cv.shape)
# Create LSA/TruncatedSVD with full dimensions
cv_lsa = TruncatedSVD(n_components=cv.shape[1]-1)
cv_lsa_data = cv_lsa.fit_transform(cv)
# Find dimensions with 80% variance explained
number = np.searchsorted(cv_lsa.explained_variance_ratio_.cumsum(), .8) + 1
print('Dimensions with 80% variance explained: ',number)
# Create LSA/TruncatedSVD with 80% variance explained
cv_lsa80 = TruncatedSVD(n_components=number)
cv_lsa_data80 = cv_lsa80.fit_transform(cv)
# Do Kmeans when k=4
kmean = KMeans(n_clusters=4)
clustered = km.fit(cv_lsa_data80)
Now I'm stuck on what to do next. I want to get the clusters identified by the kmeans object and get the top 10/most common used word in those clusters. Something like:
Cluster 1:
1st most common word - count
2nd most common word - count
Cluster 2:
1st most common word - count
2nd most common word - count
Upvotes: 0
Views: 1731
Reputation: 304
If you are looking for cluster center importance, the scikit-learn docs on kmeans says that there is a variable cluster_centers_
of shape [n_clusters, n_features]
that can help you out.
km.fit(...)
cluster_centers = km.cluster_centers_
first_cluster = cluster_centers[0] # Cluster 1
But as an addendum to that, I don't think you'll be able to get the counts out directly because you performed an LSA with SVD on the dataset, which means that you don't just have raw counts. You'd have to figure out the most important SVD elements in the kmeans side by magnitude, then figure out which words make up the counts to get your counts. You can use the variable components_
under the SVD class to do that.
Upvotes: 1