Reputation: 11192
I have a pandas dataframe like below, It contains sentence of words, and I have one more list called vocab, I want to remove all the words from sentence except the words are in vocab list.
Example df:
sentence
0 packag come differ what about tomorrow
1 Hello dear truth is hard to tell
Example vocab:
['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']
Expected O/P:
sentence res
0 packag come differ what about tomorrow packag differ tomorrow
1 Hello dear truth is hard to tell dear truth hard tell
I first tried to use .str.replace and remove all important data from sentence then store this into t1. Again does the same thing for t1 and sentence so, that i'll get my expected output. But It's not working as In expected.
My attempt:
vocab_lis=['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']
vocab_regex = ' '+' | '.join(vocab_lis)+' '
df=pd.DataFrame()
s = pd.Series(["packag come differ what about tomorrow", "Hello dear truth is hard to tell"])
df['sentence']=s
df['sentence']= ' '+df['sentence']+' '
df['t1'] = df['sentence'].str.replace(vocab_regex, ' ')
df['t2'] = df.apply(lambda x: pd.Series(x['sentence']).str.replace(' | '.join(x['t1'].split()), ' '), axis=1)
Is there any simple way to achieve my above task? I know that my code is not working because of spaces. How to solve this?
Upvotes: 1
Views: 1405
Reputation: 3770
using np.array
data
sentence
0 packag come differ what about tomorrow
1 Hello dear truth is hard to tell
Vocab
v = ['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']
first split the sentence to make a list and then using np.in1d to check for common elements between the two list.Then just joining the list to make a string
data['sentence'] = data['sentence'].apply(lambda x: ' '.join(np.array(x.split(' '))[np.in1d(x.split(' '),v)]))
Output
sentence res
0 packag come differ what about tomorrow packag differ tomorrow
1 Hello dear truth is hard to tell dear truth hard tell
Upvotes: 2
Reputation: 862406
Use nested list comprehension with split by whitespace:
df['res'] = [' '.join(y for y in x.split() if y in vocab_lis) for x in df['sentence']]
print (df)
sentence res
0 packag come differ what about tomorrow packag differ tomorrow
1 Hello dear truth is hard to tell dear truth hard tell
vocab_regex = '|'.join(r"\b{}\b".format(x) for x in vocab_lis)
df['t1'] = df['sentence'].str.replace(vocab_regex, '')
print (df)
sentence t1
0 packag come differ what about tomorrow come what about
1 Hello dear truth is hard to tell Hello is to
Upvotes: 2