Reputation: 17
I have a corpus of text that needs to be analysed. I have a data frame with the below headers.
print((df.columns.values))
>>>> ['Unique ID' 'Quarter' 'Theme' 'Subtheme' 'Driver' 'Ticker' 'Company'
'Sub-sector' 'Issue weight' 'Quote' 'Executive name' 'Designation'
'Quote_len' 'word_count']
I have written a function to find Top 20 words in the 'Quote' column after removing stop words.
def get_top_n_words(corpus, n=None):
vec = CountVectorizer(stop_words = 'english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(df['Quote'].values.astype('U'), 20)
for word, freq in common_words:
print(word, freq)
df2 = pd.DataFrame(common_words, columns = ['ReviewText' , 'count'])
df2.groupby('ReviewText').sum()['count'].sort_values(ascending=False).iplot(
kind='bar', yTitle='Count', linecolor='black', title='Top 20 words in review after removing stop words')
Now is wish to use a where clause within the code to find results for the column "Theme".
For eg. Theme= 'Competitive advantage'
How to do that?
Upvotes: 0
Views: 53
Reputation: 165
Use DataFrame.loc[...]
to filter down your results.
For example df = df.loc[df.Theme == 'Competitive advantage']
.
Then continue with common_words = get_top_n_words(df['Quote'].values.astype('U'), 20)
,
but now the dataframe will only include results where Theme == 'Competitive advantage'
.
Upvotes: 1