Jim
Jim

Reputation: 23

Plotting KMeans Clustering of Text Data in Python

I have code that cleans some text data, vectorizes it with TfidfVectorizer, and is run through a KMeans Model. Everything is working ok, with the exception of actually plotting the clusters.

I am not totally understanding the output of TfidVectorizer

For example:

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(df['column 1'].values.astype('U'))

print(X)

(0, 36021)  0.17081171474660714

(0, 36020)  0.17081171474660714

(0, 36011)  0.13668653157547714

Can someone help me understand how to actually plot the clusters? I'm a little stuck on where to go for here. Or is there a better vectorizer to use for KMeans?

Also, when I look at the cluster centers I am seeing weird output, it ends up with a few thousand columns like below... Its a relatively small dataset of about 3000 records of text

print(kmeans.cluster_centers_)

[[8.71020045e-05 8.71020045e-05 8.71020045e-05 ... 1.34902052e-05
  1.34902052e-05 1.34902052e-05]

Here is some sample code for the clustering as recommended:

df = pd.read_csv('----------------.csv')

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(df['column 1'].values.astype('U'))

true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)

model.fit(X)

print('Top Terms Per Cluster:')
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print('cluster %d' % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print(model.cluster_centers_)
print(X)

Upvotes: 0

Views: 1047

Answers (1)

Russ
Russ

Reputation: 3771

TfidfVectorizer transforms each row of your data into a sparse vector of floats, where the dimension of the vector is equal to the size of the vocabulary determined by TfidfVectorizer (so you get a matrix that is n_docs x n_vocab). Typically the vocabulary will be much larger than the number of documents. KMeans computes cluster centers in this high dimensional space. If you want to visualize these clusters in 2d or 3d, you need to use some form of dimensionality reduction to on both the Tfidf vectors and the KMeans centers. Since the Tfidf matrix is sparse, TruncatedSVD fitted on the Tfidf matrix is probably what you want.

Here's a toy example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD

docs = [
    "aa bb cc dd ee ff",
    "aa bb cc gg hh ii",
    "dd ee ff gg hh ii",
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1)
km.fit(X)

tsvd = TruncatedSVD(n_components=2).fit(X)
projected_docs = tsvd.transform(X)
projected_centers = tsvd.transform(model.cluster_centers_)

Upvotes: 1

Related Questions