Reputation: 23
I have code that cleans some text data, vectorizes it with TfidfVectorizer, and is run through a KMeans Model. Everything is working ok, with the exception of actually plotting the clusters.
I am not totally understanding the output of TfidVectorizer
For example:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['column 1'].values.astype('U'))
print(X)
(0, 36021) 0.17081171474660714
(0, 36020) 0.17081171474660714
(0, 36011) 0.13668653157547714
Can someone help me understand how to actually plot the clusters? I'm a little stuck on where to go for here. Or is there a better vectorizer to use for KMeans?
Also, when I look at the cluster centers I am seeing weird output, it ends up with a few thousand columns like below... Its a relatively small dataset of about 3000 records of text
print(kmeans.cluster_centers_)
[[8.71020045e-05 8.71020045e-05 8.71020045e-05 ... 1.34902052e-05
1.34902052e-05 1.34902052e-05]
Here is some sample code for the clustering as recommended:
df = pd.read_csv('----------------.csv')
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['column 1'].values.astype('U'))
true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print('Top Terms Per Cluster:')
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print('cluster %d' % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
print(model.cluster_centers_)
print(X)
Upvotes: 0
Views: 1047
Reputation: 3771
TfidfVectorizer
transforms each row of your data into a sparse vector of floats, where the dimension of the vector is equal to the size of the vocabulary determined by TfidfVectorizer
(so you get a matrix that is n_docs x n_vocab
). Typically the vocabulary will be much larger than the number of documents. KMeans computes cluster centers in this high dimensional space. If you want to visualize these clusters in 2d or 3d, you need to use some form of dimensionality reduction to on both the Tfidf vectors and the KMeans centers. Since the Tfidf matrix is sparse, TruncatedSVD
fitted on the Tfidf matrix is probably what you want.
Here's a toy example:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD
docs = [
"aa bb cc dd ee ff",
"aa bb cc gg hh ii",
"dd ee ff gg hh ii",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1)
km.fit(X)
tsvd = TruncatedSVD(n_components=2).fit(X)
projected_docs = tsvd.transform(X)
projected_centers = tsvd.transform(model.cluster_centers_)
Upvotes: 1