uprav
uprav

Reputation: 103

Extract top words for each cluster

I have done K-means clustering for text data

#K-means clustering
from sklearn.cluster import KMeans
num_clusters = 4
km = KMeans(n_clusters=num_clusters)
%time km.fit(features)
clusters = km.labels_.tolist()

where features is the tf-idf vector

#preprocessing text - converting to a tf-idf vector form

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=0.01,max_df=0.75, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.keywrds).toarray()
labels = df.CD

Then I added the cluster label to original dataset

df['clusters'] = clusters

And indexed the dataframe by clusters

pd.DataFrame(df,index = [clusters])

How do I fetch the top words for each cluster?

Upvotes: 1

Views: 1964

Answers (1)

leo
leo

Reputation: 441

This is not really the top words in each cluster but orders them by most frequent words. Then you can just the first word as a word group instead of a cluster num.

built a dict with all feature names and tfidf score

for f, w in zip(tfidf.get_feature_names(), tfidf.idf_):
    featurenames[len(f.split(' '))].append((f, w))
featurenames = dict(featurenames[1])

rounded off feature idf values cause they were a little long

featurenames = dict(zip(featurenames.keys(), [round(v, 4) for v in featurenames.values()]))

converted dict to df

dffeatures = pd.DataFrame.from_dict(featurenames, orient='index').reset_index() \
    .rename(columns={'index': 'featurename',0:'featureid'})
dffeatures = dffeatures.round(4)

combined feature word with id and created a new dictionary. I did this to accommodate for duplicate id's.

dffeatures['combined'] = dffeatures.apply(lambda x:'%s:%s' % (x['featureid'],x['featurename']),axis=1)
featurenamesnew = pd.Series(dffeatures.combined.values, index=dffeatures.featurename).to_dict()

{'cat': '2.3863:cat', 'cow': '3.0794:cow', 'dog': '2.674:dog'....}

created a new col in the df and replaced all word with idf:feature value

df['temp'] = df['inputdata'].replace(featurenamesnew, regex=True)

ordered the df idf:feature value ascending so most frequent words appear first

df['temp'] = df['temp'].str.split().apply(lambda x: sorted(set(x), reverse=False)).str.join(' ').to_frame()

reverese map idf:featurevalue with the words

inv_map = {v: k for k, v in featurenamesnew.items()}
df['cluster_top_n_words'] = df['temp'].replace(inv_map, regex=True)

finally keep top n words in the new df col

df['cluster_top_n_words'] = df['cluster_top_n_words'].apply(lambda x: ' '.join(x.split()[:3]))

Upvotes: 1

Related Questions