Reputation: 449
I'm trying add new column headers to a dataframe which may conflict with existing dataframe headers.
I would like to use df.replace(to_replace, replacements, regex=True)
to replace any whole words that match existing df columns with a word from the replacements list.
As an example:
If current df.columns = ['Test 1', 'Test 2']
I would like to replace all strings of 'Test 1' in the df with 'Test 1_Test 1'
If I use:
df.replace(r'\bTest 1\b', 'Test 1_Test1', regex=True)
I can replace just the instances of 'Test 1', however I would like to generalize this to be able to replace all instances of the column headers with a different word.
My understanding of the root cause is I am not generating a list of regex expressions correctly and would like to know how.
Currently I am using:
replace_list = list( r'\b' + pd.Series(list(df.columns)) + '\b')
However this does not work.
Any and all help appreciated.
Upvotes: 1
Views: 782
Reputation: 627103
Since the variable you pass to the regex pattern seems to look like word1|word2|more_here
, you may use
r'(?:\b|(?<=_))(?:' + pd.Series(list(df.columns)) + r')(?=\b|_)'
This pattern will match one of the alternatives in between word boundaries or underscores. The reason for the alternations inside the lookbehind and lookahead is that a word boundary by itself requires a non word char on either side of the search word, and that means underscores, being word characters, would fail the match.
Upvotes: 1
Reputation: 9711
If I understood correctly you want to replace the all the values in a column if it matches the header label.
import pandas as pd
import numpy as np
df=pd.DataFrame([['aa','bb','cc','dd','ee'],
['aa','Test2','cc','dd','ee'],
['aa','bb','cc','Test4','ee'],
['aa','bb','cc','dd','ee']])
df.columns=['Test1','Test2','Test3','Test4','Test5']
print df
cols=pd.Series(df.columns)
for col in cols:
replacevalue=col+'_'+col
df[col] =df[col].replace(col,replacevalue)
print df
input
Test1 Test2 Test3 Test4 Test5
0 aa bb cc dd ee
1 aa Test2 cc dd ee
2 aa bb cc Test4 ee
3 aa bb cc dd ee
output
Test1 Test2 Test3 Test4 Test5
0 aa bb cc dd ee
1 aa Test2_Test2 cc dd ee
2 aa bb cc Test4_Test4 ee
3 aa bb cc dd ee
Upvotes: 0
Reputation: 210902
Try to think differently of this task - it's much easier and much much faster to rename the original column names:
In [228]: df
Out[228]:
Test 1 Test 2
0 Will this work Test 2: Is this even legit
1 Test 1: Maybe it will work nope
2 It probably will not work nope
In [229]: df.add_prefix('~~').add_suffix('~~')
Out[229]:
~~Test 1~~ ~~Test 2~~
0 Will this work Test 2: Is this even legit
1 Test 1: Maybe it will work nope
2 It probably will not work nope
Upvotes: 0