Reputation: 119
I have a pandas data frame with quite a few different columns (4800+) and they are all named after file directories. For example, I can have two columns named:
Return.ReturnData.IRS990.EOYAmount
and Return.ReturnData.IRS990PF.EOYAmount
as you can see the only difference is at IRS990/IRS990PF
. But these two columns contain similar data so I want tor rename them both to EOYAmount
.
Obviously I have a couple thousand files so I can't do this for every single one but I wanted to know how to truncate everything before the last period? If anybody has any suggestions please let me know. Thanks!
Upvotes: 0
Views: 98
Reputation: 758
You could use the following:
import re
rename_pattern = r'.*\.' # will remove everything until
# the last . in name (greedy regex, longest match)
# applies the lambda mapper
# function to all columns and replaces the rename_pattern with an empty string
df = df.rename(columns=lambda col_name: re.sub(rename_pattern,'',col_name))
Edit:
To keep only a part of the column names you can alter the above given code as follows:
import re
rename_pattern = r'(.*?\.){2}'
df = df.rename(columns=lambda col_name: re.sub(rename_pattern,'',col_name))
The ( )
in the rename_pattern
groups the regular expression and allows to specify the occurence quantity with the trailing {2}
.
If your path is longer and you have to delete a longer pattern you can alter this 2
with any number you like.
Another change is to make .*
matching non-greedy with ?
.
This will result in only matching one word followed by a dot.
An alternation to r'(.*?\.){2}'
will be r'(\w+\.){2}'
Upvotes: 1