Reputation: 37
I've made a list of important_words
and a have a dataframe that has a column df['reviews']
, that has one string
of review text
per row (thousands of rows). I want to update the 'reviews' by removing everything that is not in the important_words
list
from the string, like the opposite of having stop words
, so that I am only left with the important_words
per every review
(row) in the df.
Also, later in my starter code I tokenize and normalize the column of df[reviews]
, it seems like applying to this column should make everything easier, since punctuation removal and lowercasing has also been applied. I'll try which ever method someone can share, thanks.
important_words = [actor, action, awesome]
df['reviews'][1] = 'The actor, in the action movie was awesome'
df['reviews'][2] = 'The action movie was not good'
....
df['tokenized_normalized_reviews'][1] = [the,actor,in,the,action,movie,was,awesome]
df['tokenized_normalized_reviews'][2] = [the, action, movie, was, not, good]
I want:
df['review_important_words'][1] = 'actor, action, awesome'
df['review_important_words'][2] = 'action'
< either str or applied to the tokenized column>
Upvotes: 0
Views: 227
Reputation: 30906
df['reviews'] = df['reviews'].apply(lambda x: ' '.join([word for word in x.split() if word in (important_words)]))
You can do it like this using pandas. Applying the function would make it work for all the elements of this column.
Upvotes: 1