wingsoficarus116
wingsoficarus116

Reputation: 449

Generate list of regex expressions to replace list of words in a dataframe python

I'm trying add new column headers to a dataframe which may conflict with existing dataframe headers.

I would like to use df.replace(to_replace, replacements, regex=True) to replace any whole words that match existing df columns with a word from the replacements list.

As an example:

If current df.columns = ['Test 1', 'Test 2']

I would like to replace all strings of 'Test 1' in the df with 'Test 1_Test 1'

If I use:

df.replace(r'\bTest 1\b', 'Test 1_Test1', regex=True) 

I can replace just the instances of 'Test 1', however I would like to generalize this to be able to replace all instances of the column headers with a different word.

My understanding of the root cause is I am not generating a list of regex expressions correctly and would like to know how.

Currently I am using:

replace_list = list( r'\b' + pd.Series(list(df.columns)) + '\b')

However this does not work.

Any and all help appreciated.

Upvotes: 1

Views: 782

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

Since the variable you pass to the regex pattern seems to look like word1|word2|more_here, you may use

r'(?:\b|(?<=_))(?:' + pd.Series(list(df.columns)) + r')(?=\b|_)'

This pattern will match one of the alternatives in between word boundaries or underscores. The reason for the alternations inside the lookbehind and lookahead is that a word boundary by itself requires a non word char on either side of the search word, and that means underscores, being word characters, would fail the match.

Upvotes: 1

Shijo
Shijo

Reputation: 9711

If I understood correctly you want to replace the all the values in a column if it matches the header label.

import pandas as pd
import numpy as np
df=pd.DataFrame([['aa','bb','cc','dd','ee'],
                 ['aa','Test2','cc','dd','ee'],
                ['aa','bb','cc','Test4','ee'],
                ['aa','bb','cc','dd','ee']])
df.columns=['Test1','Test2','Test3','Test4','Test5']

print df

cols=pd.Series(df.columns)


for col in cols:
    replacevalue=col+'_'+col
    df[col] =df[col].replace(col,replacevalue)
print df

input

  Test1  Test2 Test3  Test4 Test5
0    aa     bb    cc     dd    ee
1    aa  Test2    cc     dd    ee
2    aa     bb    cc  Test4    ee
3    aa     bb    cc     dd    ee

output

  Test1        Test2 Test3        Test4 Test5
0    aa           bb    cc           dd    ee
1    aa  Test2_Test2    cc           dd    ee
2    aa           bb    cc  Test4_Test4    ee
3    aa           bb    cc           dd    ee

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210902

Try to think differently of this task - it's much easier and much much faster to rename the original column names:

In [228]: df
Out[228]:
                       Test 1                      Test 2
0              Will this work  Test 2: Is this even legit
1  Test 1: Maybe it will work                        nope
2   It probably will not work                        nope

In [229]: df.add_prefix('~~').add_suffix('~~')
Out[229]:
                   ~~Test 1~~                  ~~Test 2~~
0              Will this work  Test 2: Is this even legit
1  Test 1: Maybe it will work                        nope
2   It probably will not work                        nope

Upvotes: 0

Related Questions