Reputation: 169
I have a DataFrame with columns id, keywords1 and keywords2. I would like to get only words from column keywords2 that are not in the column keywords1. Also I need to clean my new column with different words from meaningless words like phph, wfgh... I'm only interested in English words.
Example:
data = [[1, 'detergent', 'detergent for cleaning stains'], [2, 'battery charger', 'wwfgh, old, glass'], [3, 'sunglasses, black, metal', 'glass gggg jik xxx,'], [4, 'chemicals, flammable', 'chemicals, phph']]
df = pd.DataFrame(data, columns = ['id', 'keywords1','keywords2'])
df
Upvotes: 0
Views: 165
Reputation: 2819
Let's try:
def words_diff(words1, words2)
kw1=words1.str.split()
kw2= words2.str.split()
diff=[x for x in kw2 if x not in kw1]
return diff
df['diff'] = df.apply(lambda x: words_diff(x['keywords1'] , x['keywords2'] ), axis=1)
Upvotes: 0
Reputation: 13387
Try:
import numpy as np
#we split to get words - by every sequence of 1, or more non-letters characters
df["keywords1"]=df["keywords1"].str.split("[^\w+]").map(set)
df["keywords2"]=df["keywords2"].str.split("[^\w+]").map(set)
df["keywords3"]=np.bitwise_and(np.bitwise_xor(df["keywords1"], df["keywords2"]), df["keywords2"])
#optional-if you wish to keep it as a string, and not set:
df["keywords3"]=df["keywords3"].str.join(", ")
Outputs:
id ... keywords3
0 1 ... cleaning, for, stains
1 2 ... , wwfgh, glass, old
2 3 ... jik, xxx, glass, gggg
3 4 ... phph
Upvotes: 1