My80
My80

Reputation: 169

How to get only different words from two pandas.DataFrame columns

I have a DataFrame with columns id, keywords1 and keywords2. I would like to get only words from column keywords2 that are not in the column keywords1. Also I need to clean my new column with different words from meaningless words like phph, wfgh... I'm only interested in English words.

Example:

data = [[1, 'detergent', 'detergent for cleaning stains'], [2, 'battery charger', 'wwfgh, old, glass'], [3, 'sunglasses, black, metal', 'glass gggg jik xxx,'], [4, 'chemicals, flammable', 'chemicals, phph']] 

df = pd.DataFrame(data, columns = ['id', 'keywords1','keywords2']) 

df 

Upvotes: 0

Views: 165

Answers (2)

Renaud
Renaud

Reputation: 2819

Let's try:

def words_diff(words1, words2) 
    kw1=words1.str.split() 
    kw2= words2.str.split() 
    diff=[x for x in kw2 if x not in kw1]
    return diff


df['diff'] = df.apply(lambda x: words_diff(x['keywords1'] , x['keywords2'] ), axis=1)

Upvotes: 0

Georgina Skibinski
Georgina Skibinski

Reputation: 13387

Try:

import numpy as np

#we split to get words - by every sequence of 1, or more non-letters characters

df["keywords1"]=df["keywords1"].str.split("[^\w+]").map(set)

df["keywords2"]=df["keywords2"].str.split("[^\w+]").map(set)

df["keywords3"]=np.bitwise_and(np.bitwise_xor(df["keywords1"], df["keywords2"]), df["keywords2"])
#optional-if you wish to keep it as a string, and not set:
df["keywords3"]=df["keywords3"].str.join(", ")

Outputs:

   id  ...              keywords3
0   1  ...  cleaning, for, stains
1   2  ...    , wwfgh, glass, old
2   3  ...  jik, xxx, glass, gggg
3   4  ...                   phph

Upvotes: 1

Related Questions