Implementing tf-idf in wordclouds

Question

I have some google reviews for some universities in a dataframe like below df_unis. The column uni_name contains the university names. I wish to create word clouds for each university separately but in such a way that words appear larger in the word cloud of a university if they not only appear more frequently in the reviews of the university but also less frequently in the reviews of the other universities. I believe this is the idea behind tf-idf and upgrading from relying on simple frequency to "importance." I'm after this importance and the competitive insights from the word clouds for different universities when considered in tandem.

I have written this code but I suspect that since td-idf happens inside the loop, it's done for each university reviews separately and I'm not achieving my above goal. Is that right? I tried bringing td-idf outside the loop but then couldn't find a way to produce separate word clouds for each university. Any advice is highly appreciated.

df_unis = pd.DataFrame({'review' : ['this is a good school' , 'this school was not worth it', 'aah', 'mediocre school with good profs'], 
                       uni_name': ['a', 'b', 'b', 'a']})


corpus = df_unis.groupby('uni_name')['review'].sum() 

for group in corpus.index.unique():
    try:

        vectorizer = TfidfVectorizer(stop_words='English', ngram_range= ( 1 , 3 ) ) 

        vecs = vectorizer.fit_transform([corpus.loc[group]]) 
        feature_names = vectorizer.get_feature_names_out ()
        dense = vecs.todense()
        lst = dense.tolist()
        tf_idf = pd.DataFrame(lst, columns=feature_names, index = [group])

        cloud = WordCloud().generate_from_frequencies(tf_idf.T.sum(axis=1))

Implementing tf-idf in wordclouds

Answers (1)

Related Questions