SiDoesCoding
SiDoesCoding

Reputation: 111

Pandas DataFrame - replace substring in column if a substring exists

I am trying to update DataFrame column names if a substring exists within the column name, however I only need to update the substring, retaining all information either side of it.

Example:

import pandas as pd

lookup = [('abc_param_one_new', 'param_one'), 
          ('abc_param_two_new', 'param_two'), 
          ('abc_param_three_new', 'param_three')]

d = {'p1_abc_param_ONE_new1': [1,2],
     'p2_abc_param_one_new2': [45,3],
    'p3_abc_Param_One_new3': [76,5],
    'p4_abc_param_two_new1': [4321,6],
    'p5_abc_param_Two_new2': [3,2],
    'p6_abc_param_THREE_new1': [6,5]}

df = pd.DataFrame(d)

# make all lowercase 
df.columns = [x.lower() for x in df.columns]

I am trying to use the lookup var (this could be a different format though) to search and replace the substring within each column name.

e.g.

'p1_abc_param_ONE_new1' -> 'p1_param_one1'

I can do this primitively looping over the lookup and the checking for each column header, but can't manage to replace the substring.

Additionally it feels that the nested loop approach is not very efficient. Is there a more pythonic way to do this perhaps using built in pandas methods and/or list comprehensions?

Upvotes: 0

Views: 1095

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71689

You can use series.replace with replacement dictionary

repl = {fr'(?i){k}': v for k, v in lookup}
df.columns = df.columns.to_series().replace(repl, regex=True)

   p1_param_one1  p2_param_one2  p3_param_one3  p4_param_two1  p5_param_two2  p6_param_three1
0              1             45             76           4321              3                6
1              2              3              5              6              2                5

Upvotes: 2

mozway
mozway

Reputation: 260570

You can use a regex and str.replace:

dic = dict(lookup)
pat = '|'.join(dic)
df.columns = df.columns.str.replace(pat, lambda x: dic.get(x.group(0)), regex=True)

output (df.columns):

Index(['p1_param_one1', 'p2_param_one2', 'p3_param_one3', 'p4_param_two1',
       'p5_param_two2', 'p6_param_three1'],
      dtype='object')

Upvotes: 1

Related Questions