torkestativ
torkestativ

Reputation: 382

Remove word within a string based on another columns value

I have two columns that is a combination of comma separated words and single words in a string format. col1 will always only contain one word. In this example I will use the word Dog as the word to have in col1, but this will differ in the real data, so please do not make a solution that uses regex on Dog specifically.

df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog"],
                     "col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse"]})

I want to check if the word in col1 appears in the string in col2, and if it does, I want to remove that word from col2. But keep in mind that I want to keep the rest of the string if there are more words left. So it will go from this:

    col1    col2    
0   Dog     Cat, Mouse
1   Dog     Dog
2   Dog     Cat
3   Dog     Dog, Mouse

To this:

    col1    col2
0   Dog     Cat, Mouse
1   Dog 
2   Dog     Cat
3   Dog     Mouse

Upvotes: 1

Views: 274

Answers (3)

sammywemmy
sammywemmy

Reputation: 28644

Try this:

import re
df['col2'] = [(re.sub(fr"({word}[\s,]*)","",sentence)) 
             for word,sentence in zip(df.col1,df.col2)]
df

    col1    col2
0   Dog     Cat, Mouse
1   Dog 
2   Dog     Cat
3   Dog     Mouse

another df, with dog in the middle :

df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog","Dog"],
                     "col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse", "Cat, Dog, Mouse"]})

df


   col1     col2
0   Dog     Cat, Mouse
1   Dog     Dog
2   Dog     Cat
3   Dog     Dog, Mouse
4   Dog     Cat, Dog, Mouse

Apply the code above :

   col1     col2
0   Dog     Cat, Mouse
1   Dog 
2   Dog     Cat
3   Dog     Mouse
4   Dog     Cat, Mouse

Upvotes: 3

wwnde
wwnde

Reputation: 26676

l=df.col1.tolist()#list of col1

Create set from col2, evaluate membership of l in set by finding difference applying lambda function.

df['col2']=list(zip(df.col2))
df['col2']=df.col2.apply(lambda x:[*{*x}-{*l}]).str[0]

enter image description here

Upvotes: 1

sushanth
sushanth

Reputation: 8302

(^,|,$) to handle starting & trailing comma
(,\s|,) will remove comma those getting retained after replace operation.
{1,} to skip non-repeated comma

df['col2'] = df['col2'].str. \
    replace("|".join(df['col1'].unique()), "").str.strip() \
    .str.replace("(?:^,|,$)", "") \
    .str.replace("(?:,\s|,){1,}", ",")

  col1          col2
0  Dog     Cat,Mouse
1  Dog              
2  Dog           Cat
3  Dog   Mouse,Mouse

Upvotes: 2

Related Questions