Stacey
Stacey

Reputation: 5097

Removing a list of letter groupings and words from data-frame populated with sentences

I have a dataframe df which contains uncleaned text strings

                             phrase
 0           the quick brown br fox
 1   jack and jill went up the hill

I also have a list of words and letter groupings that I'd like to remove called remove which looks like:

['br', and]

In this example I'd like the following output:

                         phrase
 0          the quick brown fox
 1   jack jill went up the hill

Note it's not the br in 'brown' remains in df as it's part of a larger word but the 'br' on its own is removed.

I've tried:

df['phrase']=[re.sub(r"\b%remove\b", "", sent) for sent in df['phrase']]

But can't get it to work correctly. What can I try next?

Upvotes: 1

Views: 49

Answers (2)

BENY
BENY

Reputation: 323266

I feel like it can be down with replace

s=[r'\b'+x+r'\b' for x in L]

df.phrase.str.replace('|'.join(s),'')
Out[176]: 
0           the quick brown  fox
1    jack  jill went up the hill
Name: phrase, dtype: object

Upvotes: 1

jezrael
jezrael

Reputation: 862681

Use nested list comprehension with split, tes membership by in and join splitted values back:

L = ['br', 'and']

df['phrase']=[' '.join(x for x in sent.split() if x not in L) for sent in df['phrase']]
print (df)
                       phrase
0         the quick brown fox
1  jack jill went up the hill

Upvotes: 3

Related Questions