pandini
pandini

Reputation: 69

Remove word in a list from two columns if the word is in both columns in Python

Input data:

data = {'A': ['word1 other stuff', 'otherstuff word1', 'hello word3 bye', 'word1s more stuff'],
        'B': ['foo word1', 'word2 hello', 'word2 bye', 'some str word1']
}

df = pd.DataFrame (data, columns = ['A', 'B'])

I'm trying to remove words from two strings only if the same word (e.g. word1 as a standalone word) is in the same row of column A and column B. In the example provided, word1 should only be removed from the first row and none of the other rows (NOTE! word1 should not be removed from the 4th row since it says "word1s..." and not word1 as a standalone word.)

OLD CODE:

## OLD CODE: 

mywordslist = ["word1", "word2", "word3"]

for word in mywordslist:
    if ((word in df['A']) and (word in df['B'])):
        df['word_removed'] = 1 ## indicator if both A and B contained the word.
        df['A_new'] = df['A'].apply(lambda x: re.sub(r'\b{}\b'.format(re.escape(word)), ' ', x)
        df['A_new'] = df['A_new'].apply(lambda x: re.sub(r'\b{}$'.format(re.escape(word)), ' ', x)
        df['A_new'] = df['A_new'].apply(lambda x: re.sub(r'${}\b'.format(re.escape(word)), ' ', x)
    else: 
        print('word not in both A and B')

So in the example, word1 should be removed from the first row since word1 is in the first row of both A and B.

The code runs but even though there are many instances where the words should be removed, they are not and the indicator does not show that the word is in both strings. How should I correctly specify the if statement?

Using the code suggestion provided by @IoaTzimas, word1 gets removed also from examples like in row 4 where word1 is in A but word1s is in B so both A and B should be kept as they are without removing word1:

### Adopting @IoaTzimas suggestion:

def fA(x,y):
    for k in mywordslist:
       if k in x and k in y:
           x=re.sub(r'\b{}\b'.format(re.escape(k)), ' ', x)
    return x

def fB(x,y):
    for k in mywordslist:
       if k in x and k in y:
           y=re.sub(r'\b{}\b'.format(re.escape(k)), ' ', y)
    return y

df['newA'] = df.apply(lambda x: fA(x.A, x.B), axis=1)
df['newB'] = df.apply(lambda x: fB(x.A, x.B), axis=1)

NEW QUESTION: How can I remove word1 only in row1 where word1 is standalone? (i.e. I want to only remove r'\bk\b' and not any occurrence of k in combination with other characters).

EDIT: This seems to solve the issue:

mywordslist = ["word1", "word2", "word3"] 

def fA(x,y):
    for k in mywordslist:
        if re.search(r'\b' + re.escape(k) + r'\b', x) and re.search(r'\b' + re.escape(k) + r'\b', y):
            x=x.replace(k,' ')
            ## x=re.sub(r'\b{}$'.format(re.escape(k)), ' ', x) If only want to remove the word from the end of the string.
    return x

def fB(x,y):
    for k in mywordslist:
        if re.search(r'\b' + re.escape(k) + r'\b', x) and re.search(r'\b' + re.escape(k) + r'\b', y):
            y=y.replace(k, ' ')
    return y

df['newA'] = df.apply(lambda x: fA(x.A, x.B), axis=1)
df['newB'] = df.apply(lambda x: fB(x.A, x.B), axis=1)

Upvotes: 1

Views: 178

Answers (1)

IoaTzimas
IoaTzimas

Reputation: 10624

Here is my suggestion:

Your input data:

data = {'A': ['word1 other stuff', 'otherstuff word1', 'hello word3 bye'],
        'B': ['foo word1', 'word2 hello', 'word2 bye']
}

df = pd.DataFrame (data, columns = ['A', 'B'])

print(df)
                   A            B
0  word1 other stuff  foo word1
1  otherstuff word1   word2 hello
2  hello word3 bye    word2 bye

mywordslist = ["word1", "word2", "word3"]

Solution:

def fA(x,y):
    for k in mywordslist:
       if k in x.split(' ') and k in y.split(' '):
           x=x.replace(k, '')
    return x

def fB(x,y):
    for k in mywordslist:
       if k in x.split(' ') and k in y.split(' '):
           y=y.replace(k, '')
    return y

df['newA'] = df.apply(lambda x: fA(x.A, x.B), axis=1)
df['newB'] = df.apply(lambda x: fB(x.A, x.B), axis=1)

del df['A']
del df['B']
df=df.rename(columns={'newA':'A', 'newB':'B'})

print(df)

Output:

          A            B
0   other stuff      foo
1  otherstuff word1  word2 hello
2  hello word3 bye   word2 bye

Upvotes: 1

Related Questions