profhoff
profhoff

Reputation: 1047

remove stopwords from pandas df with user-supplied list

I have a raw_corpus and am trying to delete stopwords with a user-defined stoplist (I edited the nltk english stopwords file). Something must be wrong with my stopwords file?

Here's the input pandas df raw_corpus:

raw_corpus

Here's my code:

#my own custom stopwords list
stoplist="/User/dlhoffman/nltk_data/corpora/stopwords/english"
#filter out stopwords
raw_corpus['constructed_recipe'] = raw_corpus['constructed_recipe'].apply(lambda x: [item for item in x if 
item not in stoplist])
#running the code below verifies empty dataframe
#raw_corpus['constructed_recipe'] = raw_corpus['constructed_recipe'].apply(lambda x: [])

Here's the result - obviously not what I'm looking for! what's wrong?:

output

Upvotes: 0

Views: 3230

Answers (1)

jpp
jpp

Reputation: 164773

pd.Series.apply with a generator expression should work:

import pandas as pd
import re

df = pd.DataFrame([['this is the first test string'],
                   ['this is yet another test'],
                   ['this is a third test item'],
                   ['this is the final test string']],
                  columns=['String'])

replace_set = {'this', 'is'}

df['String'] = df['String'].str.split(' ').apply(lambda x: ' '.join(k for k in x if k not in replace_set))

# df
#                     String
# 0    the first test string
# 1         yet another test
# 2        a third test item
# 3    the final test string

Explanation

  • pd.Series.str.split splits words by whitespace, returning a series of lists, with each list item a word.
  • pd.Series.apply accepts a lambda (anonymous) function as an input, effectively applying a function to each item in the series in a loop.
  • The generator expression (k for k in x if k not in replace_set) returns each value of k as an iterable subject to the if condition.
  • ' '.join is used on the generator expression to form a string from the generated words.

Upvotes: 1

Related Questions