Reputation: 1047
I have a raw_corpus and am trying to delete stopwords with a user-defined stoplist (I edited the nltk english stopwords file). Something must be wrong with my stopwords file?
Here's the input pandas df raw_corpus:
Here's my code:
#my own custom stopwords list
stoplist="/User/dlhoffman/nltk_data/corpora/stopwords/english"
#filter out stopwords
raw_corpus['constructed_recipe'] = raw_corpus['constructed_recipe'].apply(lambda x: [item for item in x if
item not in stoplist])
#running the code below verifies empty dataframe
#raw_corpus['constructed_recipe'] = raw_corpus['constructed_recipe'].apply(lambda x: [])
Here's the result - obviously not what I'm looking for! what's wrong?:
Upvotes: 0
Views: 3230
Reputation: 164773
pd.Series.apply with a generator expression should work:
import pandas as pd
import re
df = pd.DataFrame([['this is the first test string'],
['this is yet another test'],
['this is a third test item'],
['this is the final test string']],
columns=['String'])
replace_set = {'this', 'is'}
df['String'] = df['String'].str.split(' ').apply(lambda x: ' '.join(k for k in x if k not in replace_set))
# df
# String
# 0 the first test string
# 1 yet another test
# 2 a third test item
# 3 the final test string
Explanation
pd.Series.str.split
splits words by whitespace, returning a series of lists, with each list item a word.pd.Series.apply
accepts a lambda
(anonymous) function as an input, effectively applying a function to each item in the series in a loop.(k for k in x if k not in replace_set)
returns each value of k
as an iterable subject to the if
condition.' '.join
is used on the generator expression to form a string from the generated words.Upvotes: 1