How to truncate the beginning of all columns in a pandas data frame?

Question

I have a pandas data frame with quite a few different columns (4800+) and they are all named after file directories. For example, I can have two columns named:

Return.ReturnData.IRS990.EOYAmount and Return.ReturnData.IRS990PF.EOYAmount as you can see the only difference is at IRS990/IRS990PF. But these two columns contain similar data so I want tor rename them both to EOYAmount.

Obviously I have a couple thousand files so I can't do this for every single one but I wanted to know how to truncate everything before the last period? If anybody has any suggestions please let me know. Thanks!

Phil Leh · Accepted Answer

You could use the following:

import re

rename_pattern = r'.*\.' # will remove everything until 
                         # the last . in name (greedy regex, longest match)

# applies the lambda mapper 
# function to all columns and replaces the rename_pattern with an empty string
df = df.rename(columns=lambda col_name: re.sub(rename_pattern,'',col_name))

Edit:

To keep only a part of the column names you can alter the above given code as follows:

import re

rename_pattern = r'(.*?\.){2}' 

df = df.rename(columns=lambda col_name: re.sub(rename_pattern,'',col_name))

The ( ) in the rename_pattern groups the regular expression and allows to specify the occurence quantity with the trailing {2}. If your path is longer and you have to delete a longer pattern you can alter this 2 with any number you like. Another change is to make .* matching non-greedy with ?. This will result in only matching one word followed by a dot.

An alternation to r'(.*?\.){2}' will be r'(\w+\.){2}'

How to truncate the beginning of all columns in a pandas data frame?

Answers (1)

Related Questions