Remove all the words except in list

Question

I have a pandas dataframe like below, It contains sentence of words, and I have one more list called vocab, I want to remove all the words from sentence except the words are in vocab list.

Example df:

                                 sentence
0  packag come differ what about tomorrow
1        Hello dear truth is hard to tell

Example vocab:

['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']

Expected O/P:

                                   sentence                  res
0   packag come differ what about tomorrow     packag differ tomorrow
1         Hello dear truth is hard to tell    dear truth hard tell

I first tried to use .str.replace and remove all important data from sentence then store this into t1. Again does the same thing for t1 and sentence so, that i'll get my expected output. But It's not working as In expected.

My attempt:

vocab_lis=['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']
vocab_regex = ' '+' | '.join(vocab_lis)+' '
df=pd.DataFrame()
s = pd.Series(["packag come differ what about tomorrow", "Hello dear truth is hard to tell"])
df['sentence']=s
df['sentence']= ' '+df['sentence']+' '

df['t1'] = df['sentence'].str.replace(vocab_regex, ' ')
df['t2'] = df.apply(lambda x: pd.Series(x['sentence']).str.replace(' | '.join(x['t1'].split()), ' '), axis=1)

Is there any simple way to achieve my above task? I know that my code is not working because of spaces. How to solve this?

jezrael · Accepted Answer

Use nested list comprehension with split by whitespace:

df['res'] = [' '.join(y for y in x.split() if y in vocab_lis) for x in df['sentence']]
print (df)
                                 sentence                     res
0  packag come differ what about tomorrow  packag differ tomorrow
1        Hello dear truth is hard to tell    dear truth hard tell

vocab_regex = '|'.join(r"\b{}\b".format(x) for x in vocab_lis)
df['t1'] = df['sentence'].str.replace(vocab_regex, '')
print (df)
                                 sentence                  t1
0  packag come differ what about tomorrow   come  what about 
1        Hello dear truth is hard to tell     Hello   is  to

Remove all the words except in list

Answers (2)

Related Questions