Mohamed Thasin ah
Mohamed Thasin ah

Reputation: 11192

Remove all the words except in list

I have a pandas dataframe like below, It contains sentence of words, and I have one more list called vocab, I want to remove all the words from sentence except the words are in vocab list.

Example df:

                                 sentence
0  packag come differ what about tomorrow
1        Hello dear truth is hard to tell

Example vocab:

['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']

Expected O/P:

                                   sentence                  res
0   packag come differ what about tomorrow     packag differ tomorrow
1         Hello dear truth is hard to tell    dear truth hard tell

I first tried to use .str.replace and remove all important data from sentence then store this into t1. Again does the same thing for t1 and sentence so, that i'll get my expected output. But It's not working as In expected.

My attempt:

vocab_lis=['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']
vocab_regex = ' '+' | '.join(vocab_lis)+' '
df=pd.DataFrame()
s = pd.Series(["packag come differ what about tomorrow", "Hello dear truth is hard to tell"])
df['sentence']=s
df['sentence']= ' '+df['sentence']+' '

df['t1'] = df['sentence'].str.replace(vocab_regex, ' ')
df['t2'] = df.apply(lambda x: pd.Series(x['sentence']).str.replace(' | '.join(x['t1'].split()), ' '), axis=1)

Is there any simple way to achieve my above task? I know that my code is not working because of spaces. How to solve this?

Upvotes: 1

Views: 1405

Answers (2)

iamklaus
iamklaus

Reputation: 3770

using np.array

data

                                   sentence
0    packag come differ what about tomorrow
1          Hello dear truth is hard to tell

Vocab

v = ['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']

first split the sentence to make a list and then using np.in1d to check for common elements between the two list.Then just joining the list to make a string

data['sentence'] = data['sentence'].apply(lambda x: ' '.join(np.array(x.split(' '))[np.in1d(x.split(' '),v)]))

Output

                                   sentence                     res
0    packag come differ what about tomorrow  packag differ tomorrow
1          Hello dear truth is hard to tell    dear truth hard tell

Upvotes: 2

jezrael
jezrael

Reputation: 862406

Use nested list comprehension with split by whitespace:

df['res'] = [' '.join(y for y in x.split() if y in vocab_lis) for x in df['sentence']]
print (df)
                                 sentence                     res
0  packag come differ what about tomorrow  packag differ tomorrow
1        Hello dear truth is hard to tell    dear truth hard tell

vocab_regex = '|'.join(r"\b{}\b".format(x) for x in vocab_lis)
df['t1'] = df['sentence'].str.replace(vocab_regex, '')
print (df)
                                 sentence                  t1
0  packag come differ what about tomorrow   come  what about 
1        Hello dear truth is hard to tell     Hello   is  to

Upvotes: 2

Related Questions