Remove uncommon words in columns of two dataframes in pandas

Question

I have two dataframes: df1 and df2.

df1 looks like this:

id   text
1    I love this car
2    I hate this car
3    Cars are life
4    Bikers are also good

df2 looks like this:

id   text
1    I love this supercar
2    I hate cars
3    Cars are love
4    Bikers are nice

Now, I want to keep only those words that are in df1 and df2.

The word car is in df1 but not in df2, so I want to remove it.

The word life is in df1 but not in df2, so I want to remove it.

The word also is in df1 but not in df2, so I want to remove it.

The word good is in df1 but not in df2, so I want to remove it.

The word supercar is in df2 but not in df1, so I want to remove it.

The word nice is in df2 but not in df1, so I want to remove it.

Expected output of df1:

id   text
1    I love this
2    I hate this
3    Cars are
4    Bikers are

Expected output of df2

id   text
1    I love this
2    I hate cars
3    Cars are love
4    Bikers are

jezrael · Accepted Answer

Create intersection of words in both columns and then remove not matched values:

a = set([y for x in df1['text'] for y in x.split()])
b = set([y for x in df2['text'] for y in x.split()])
c = a & b
print (c)
{'hate', 'are', 'Bikers', 'this', 'love', 'I', 'Cars'}

df1['text'] = df1['text'].apply(lambda x: ' '.join(y for y in x.split() if y in c))
df2['text'] = df2['text'].apply(lambda x: ' '.join(y for y in x.split() if y in c))
print (df1)
   id         text
0   1  I love this
1   2  I hate this
2   3     Cars are
3   4   Bikers are

print (df2)
   id           text
0   1    I love this
1   2         I hate
2   3  Cars are love
3   4     Bikers are

Remove uncommon words in columns of two dataframes in pandas

Answers (1)

Related Questions