Reputation: 2099
I have some google reviews for some universities in a dataframe like below df_unis. The column uni_name
contains the university names. I wish to create word clouds for each university separately but in such a way that words appear larger in the word cloud of a university if they not only appear more frequently in the reviews of the university but also less frequently in the reviews of the other universities. I believe this is the idea behind tf-idf and upgrading from relying on simple frequency to "importance." I'm after this importance and the competitive insights from the word clouds for different universities when considered in tandem.
I have written this code but I suspect that since td-idf happens inside the loop, it's done for each university reviews separately and I'm not achieving my above goal. Is that right? I tried bringing td-idf outside the loop but then couldn't find a way to produce separate word clouds for each university. Any advice is highly appreciated.
df_unis = pd.DataFrame({'review' : ['this is a good school' , 'this school was not worth it', 'aah', 'mediocre school with good profs'],
uni_name': ['a', 'b', 'b', 'a']})
corpus = df_unis.groupby('uni_name')['review'].sum()
for group in corpus.index.unique():
try:
vectorizer = TfidfVectorizer(stop_words='English', ngram_range= ( 1 , 3 ) )
vecs = vectorizer.fit_transform([corpus.loc[group]])
feature_names = vectorizer.get_feature_names_out ()
dense = vecs.todense()
lst = dense.tolist()
tf_idf = pd.DataFrame(lst, columns=feature_names, index = [group])
cloud = WordCloud().generate_from_frequencies(tf_idf.T.sum(axis=1))
Upvotes: 3
Views: 168
Reputation: 630
Your code currently calculates TF-IDF scores individually for each university, which means you're not comparing term importance across all universities collectively. To make words that are more unique to each university stand out (i.e., terms that are frequent in one university’s reviews but rare in others), you should calculate TF-IDF scores across all universities in a single pass, then filter or plot the words for each university.
You can try this code,
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Sample data
df_unis = pd.DataFrame({
'review': ['this is a good school', 'this school was not worth it', 'aah', 'mediocre school with good profs'],
'uni_name': ['a', 'b', 'b', 'a']
})
# Combine reviews for each university
corpus = df_unis.groupby('uni_name')['review'].apply(lambda x: " ".join(x))
# Calculate TF-IDF across the entire corpus
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 3))
tfidf_matrix = vectorizer.fit_transform(corpus)
# Get feature names (words) and transform TF-IDF matrix to DataFrame
feature_names = vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), index=corpus.index, columns=feature_names)
# Generate word clouds for each university using TF-IDF scores
for uni in tfidf_df.index:
# Get TF-IDF scores for the specific university
uni_tfidf = tfidf_df.loc[uni]
word_freq = uni_tfidf[uni_tfidf > 0].sort_values(ascending=False)
# Generate word cloud
wordcloud = WordCloud().generate_from_frequencies(word_freq)
# Display the word cloud
plt.figure(figsize=(8, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title(f"Word Cloud for University '{uni}'")
plt.show()
Upvotes: 0