Reputation: 398
I have a data set that has columns for number of units sold in a given month - the problem being that the monthly units columns are named in MM/yyyy format, meaning that I have 12 columns of units information per record.
So for instance, my data looks like:
ProductID | CustomerID | 04/2018 | 03/2018 | 02/2018 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
What causes this to be problematic is that a new file comes in every month, with the same file name, but different column headers for the units information based on the last 12 months.
What I would like to do, is rename the monthly units columns to Month1, Month2, Month3... based on a simple regex such as ([0-9]*)/([0-9]*)
that will result in the output:
ProductID | CustomerID | Month1 | Month2 | Month3 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
I know that this should be possible using Python, but as I have never used Python before (I am an old .Net developer) I honestly have no idea how to achieve this.
I have done a bit of research on renaming columns in Python, but none of them mentioned pattern matching to rename a column, eg:
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
UPDATE: The data that I am showing in my example is only a subset of the columns; total, in my data set I have 120 columns, only 12 of which need to be renamed, this is why I thought that regex might be the simplest way to go.
Upvotes: 0
Views: 2050
Reputation: 383
Since rename
could take a function as a mapper, we could define a customized function which returns a new column name in the new format if the old column name matches regex; otherwise, returns the same column name. For example,
import re
def mapper(old_name):
match = re.match(r'([0-9]*)/([0-9]*)', old_name)
if match:
return 'Month{}'.format(int(match.group(1)))
return old_name
df = df.rename(columns=mapper)
Upvotes: 1
Reputation: 338
import re
# regex pattern
pattern = re.compile("([0-9]*)/([0-9]*)")
# get headers as list
headers = list(df)
# apply regex
months = 1
for index, header in enumerate(headers):
if pattern.match(header):
headers[index] = 'Month{}'.format(months)
months += 1
# set new list as column headers
df.columns = headers
Upvotes: 1
Reputation: 3591
If you have some set names that you want to convert to, then rather than using rename, it might easier to just pass a new list to the df.columns attribute
df.columns = ['ProductID','CustomerID']+['Month{}'.format(i) for i in range(12)]+['FileDate']
If you want to use rename, if you can write a function find_new_name
that does the conversion you want for a single name, you can rename an entire list old_names
with
df.rename(columns = {oldname:find_new_name(old_name) for old_name in old_names})
Or if you have a function that takes a new name and figures out what old name corresponds to it, then it would be
df.rename(columns = {find_old_name(new_name):new_name for new_name in new_names})
You can also do
for new_name in new_names:
old_name = find_new_name(old_name)
df[new_name] = df[old_name]
This will copy the data into new columns with the new names rather than renaming, so you can then subset to just the columns you want.
Upvotes: 1