Plotting KMeans Clustering of Text Data in Python

Question

I have code that cleans some text data, vectorizes it with TfidfVectorizer, and is run through a KMeans Model. Everything is working ok, with the exception of actually plotting the clusters.

I am not totally understanding the output of TfidVectorizer

For example:

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(df['column 1'].values.astype('U'))

print(X)

(0, 36021)  0.17081171474660714

(0, 36020)  0.17081171474660714

(0, 36011)  0.13668653157547714

Can someone help me understand how to actually plot the clusters? I'm a little stuck on where to go for here. Or is there a better vectorizer to use for KMeans?

Also, when I look at the cluster centers I am seeing weird output, it ends up with a few thousand columns like below... Its a relatively small dataset of about 3000 records of text

print(kmeans.cluster_centers_)

[[8.71020045e-05 8.71020045e-05 8.71020045e-05 ... 1.34902052e-05
  1.34902052e-05 1.34902052e-05]

Here is some sample code for the clustering as recommended:

df = pd.read_csv('----------------.csv')

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(df['column 1'].values.astype('U'))

true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)

model.fit(X)

print('Top Terms Per Cluster:')
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print('cluster %d' % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print(model.cluster_centers_)
print(X)

Plotting KMeans Clustering of Text Data in Python

Answers (1)

Related Questions