Reputation: 23
I'm writing a function that takes in a dataframe(df) of tweets as input. I need to tokenize the tweets and remove the stop words and add this output to a new column. I can't import anything except numpy and pandas.
The stop words are in a dictionary as follows:
stop_words_dict = {
'stopwords':[
'where', 'done', 'if', 'before', 'll', 'very', 'keep', 'something', 'nothing', 'thereupon',
'may', 'why', '’s', 'therefore', 'you', 'with', 'towards', 'make', 'really', 'few', 'former',
'during', 'mine', 'do', 'would', 'of', 'off', 'six', 'yourself', 'becoming', 'through',
'seeming', 'hence', 'us', 'anywhere....}
This is what I attempted to do: A function to remove the stop words
def stop_words_remover(df):
stop_words = list(stop_words_dict.values())
df["Without Stop Words"] = df["Tweets"].str.lower().str.split()
df["Without Stop Words"] = df["Without Stop Words"].apply(lambda x: [word for word in x if word not in stop_words])
return df
So if this was my input:
[@bongadlulane, please, send, an, email, to,]
This is the expected output:
[@bongadlulane, send, email, [email protected]]
but I keep returning the former instead of the latter
Any insight would be really appreciated. Thank you
Upvotes: 0
Views: 1116
Reputation: 4618
Your problem is in this line:
stop_words = list(stop_words_dict.values())
This returns a list of the list of stop words
Replace it by:
stop_words = stop_words_dict['stopwords']
Upvotes: 1