Reputation: 382
I have two columns that is a combination of comma separated words and single words in a string format. col1
will always only contain one word. In this example I will use the word Dog as the word to have in col1
, but this will differ in the real data, so please do not make a solution that uses regex on Dog specifically.
df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog"],
"col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse"]})
I want to check if the word in col1
appears in the string in col2
, and if it does, I want to remove that word from col2
. But keep in mind that I want to keep the rest of the string if there are more words left. So it will go from this:
col1 col2
0 Dog Cat, Mouse
1 Dog Dog
2 Dog Cat
3 Dog Dog, Mouse
To this:
col1 col2
0 Dog Cat, Mouse
1 Dog
2 Dog Cat
3 Dog Mouse
Upvotes: 1
Views: 274
Reputation: 28644
Try this:
import re
df['col2'] = [(re.sub(fr"({word}[\s,]*)","",sentence))
for word,sentence in zip(df.col1,df.col2)]
df
col1 col2
0 Dog Cat, Mouse
1 Dog
2 Dog Cat
3 Dog Mouse
another df, with dog in the middle :
df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog","Dog"],
"col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse", "Cat, Dog, Mouse"]})
df
col1 col2
0 Dog Cat, Mouse
1 Dog Dog
2 Dog Cat
3 Dog Dog, Mouse
4 Dog Cat, Dog, Mouse
Apply the code above :
col1 col2
0 Dog Cat, Mouse
1 Dog
2 Dog Cat
3 Dog Mouse
4 Dog Cat, Mouse
Upvotes: 3
Reputation: 26676
l=df.col1.tolist()
#list of col1
Create set from col2
, evaluate membership of l
in set by finding difference applying lambda function.
df['col2']=list(zip(df.col2))
df['col2']=df.col2.apply(lambda x:[*{*x}-{*l}]).str[0]
Upvotes: 1
Reputation: 8302
(^,|,$)
to handle starting & trailing comma
(,\s|,)
will remove comma those getting retained after replace operation.
{1,}
to skip non-repeated comma
df['col2'] = df['col2'].str. \
replace("|".join(df['col1'].unique()), "").str.strip() \
.str.replace("(?:^,|,$)", "") \
.str.replace("(?:,\s|,){1,}", ",")
col1 col2
0 Dog Cat,Mouse
1 Dog
2 Dog Cat
3 Dog Mouse,Mouse
Upvotes: 2