Reputation: 103
I have done K-means clustering for text data
#K-means clustering
from sklearn.cluster import KMeans
num_clusters = 4
km = KMeans(n_clusters=num_clusters)
%time km.fit(features)
clusters = km.labels_.tolist()
where features is the tf-idf vector
#preprocessing text - converting to a tf-idf vector form
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=0.01,max_df=0.75, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.keywrds).toarray()
labels = df.CD
Then I added the cluster label to original dataset
df['clusters'] = clusters
And indexed the dataframe by clusters
pd.DataFrame(df,index = [clusters])
How do I fetch the top words for each cluster?
Upvotes: 1
Views: 1964
Reputation: 441
This is not really the top words in each cluster but orders them by most frequent words. Then you can just the first word as a word group instead of a cluster num.
built a dict with all feature names and tfidf score
for f, w in zip(tfidf.get_feature_names(), tfidf.idf_):
featurenames[len(f.split(' '))].append((f, w))
featurenames = dict(featurenames[1])
rounded off feature idf values cause they were a little long
featurenames = dict(zip(featurenames.keys(), [round(v, 4) for v in featurenames.values()]))
converted dict to df
dffeatures = pd.DataFrame.from_dict(featurenames, orient='index').reset_index() \
.rename(columns={'index': 'featurename',0:'featureid'})
dffeatures = dffeatures.round(4)
combined feature word with id and created a new dictionary. I did this to accommodate for duplicate id's.
dffeatures['combined'] = dffeatures.apply(lambda x:'%s:%s' % (x['featureid'],x['featurename']),axis=1)
featurenamesnew = pd.Series(dffeatures.combined.values, index=dffeatures.featurename).to_dict()
{'cat': '2.3863:cat', 'cow': '3.0794:cow', 'dog': '2.674:dog'....}
created a new col in the df and replaced all word with idf:feature value
df['temp'] = df['inputdata'].replace(featurenamesnew, regex=True)
ordered the df idf:feature value ascending so most frequent words appear first
df['temp'] = df['temp'].str.split().apply(lambda x: sorted(set(x), reverse=False)).str.join(' ').to_frame()
reverese map idf:featurevalue with the words
inv_map = {v: k for k, v in featurenamesnew.items()}
df['cluster_top_n_words'] = df['temp'].replace(inv_map, regex=True)
finally keep top n words in the new df col
df['cluster_top_n_words'] = df['cluster_top_n_words'].apply(lambda x: ' '.join(x.split()[:3]))
Upvotes: 1