Gazoo
Gazoo

Reputation: 37

NLP preprocessing remove all words in string not found in my list

I've made a list of important_words and a have a dataframe that has a column df['reviews'], that has one string of review text per row (thousands of rows). I want to update the 'reviews' by removing everything that is not in the important_words list from the string, like the opposite of having stop words, so that I am only left with the important_words per every review (row) in the df.

Also, later in my starter code I tokenize and normalize the column of df[reviews], it seems like applying to this column should make everything easier, since punctuation removal and lowercasing has also been applied. I'll try which ever method someone can share, thanks.

important_words = [actor, action, awesome]

   df['reviews'][1] = 'The actor, in the action movie was awesome'
   df['reviews'][2] = 'The action movie was not good'
   ....
   df['tokenized_normalized_reviews'][1] = [the,actor,in,the,action,movie,was,awesome]
   df['tokenized_normalized_reviews'][2] = [the, action, movie, was, not, good]

I want: 
df['review_important_words'][1] = 'actor, action, awesome' 
df['review_important_words'][2] = 'action' 
< either str or applied to the tokenized column>
 

Upvotes: 0

Views: 227

Answers (1)

user2736738
user2736738

Reputation: 30906

df['reviews'] = df['reviews'].apply(lambda x: ' '.join([word for word in x.split() if word in (important_words)]))

You can do it like this using pandas. Applying the function would make it work for all the elements of this column.

Upvotes: 1

Related Questions