Reputation: 5097
I have a dataframe df
which contains uncleaned text strings
phrase
0 the quick brown br fox
1 jack and jill went up the hill
I also have a list of words and letter groupings that I'd like to remove
called remove which looks like:
['br', and]
In this example I'd like the following output:
phrase
0 the quick brown fox
1 jack jill went up the hill
Note it's not the br
in 'brown' remains in df
as it's part of a larger word but the 'br' on its own is removed.
I've tried:
df['phrase']=[re.sub(r"\b%remove\b", "", sent) for sent in df['phrase']]
But can't get it to work correctly. What can I try next?
Upvotes: 1
Views: 49
Reputation: 323266
I feel like it can be down with replace
s=[r'\b'+x+r'\b' for x in L]
df.phrase.str.replace('|'.join(s),'')
Out[176]:
0 the quick brown fox
1 jack jill went up the hill
Name: phrase, dtype: object
Upvotes: 1
Reputation: 862681
Use nested list comprehension with split
, tes membership by in
and join splitted values back:
L = ['br', 'and']
df['phrase']=[' '.join(x for x in sent.split() if x not in L) for sent in df['phrase']]
print (df)
phrase
0 the quick brown fox
1 jack jill went up the hill
Upvotes: 3