user23237706
user23237706

Reputation: 3

Python Pandas df Rename columns based on condition

I have a data frame that contains four columns that do not have a label (so they are df[0],df[1], df[2], and df[3] based on position).

I need to interpret what the column names are based on the data. For example purposes, let's say the columns needed are 'First Name', 'Last Name', 'Age', and 'Gender'. Currently I assume that this is the order of the data, but I need to be able to accept the data in any order (based on varied user input). The data is not labeled in a way that would be useful for me to say column 3 is age or column 1 is gender, because there is no header.

I know, based on an allowed set of values, what each column can contain. Age can be from 0-99. Gender can be Male, Female, or Other. The first and last names are from known lists. Is there a way I can set the name of the columns based on this information? I know I can count the number of values in each column that are within the expected value for age, gender, name, and determine that column 1 in the data frame has 100% values expected for age, but I cannot figure out how to rename the column based on this information.

Here is an example of the data:

0 1 2 3
Tom Male 25 Smith
John Male 18 Doe
Lucy Female 7 Black
Jane Female 48 Doe

Upvotes: 0

Views: 82

Answers (2)

mozway
mozway

Reputation: 262234

Since you have criteria to determine the likelihood of a column to be a specific header, you could apply tests on each one and select the one with highest probability using sum+idxmax, then rename:

first_names = ['Eve', 'Jane', 'John', 'Lucy', 'Tom']
last_names = ['Black', 'Doe', 'White']

# Age is a numeric column with values between 0-100
col_age = (df.select_dtypes('number')
             .apply(lambda x: x.between(0, 100)).sum().idxmax()
           )

# Gender contains Male/Female/Other
col_gender = df.isin(['Male', 'Female', 'Other']).sum().idxmax()

# First/Last names contain values from list of known first/last names
col_first = df.isin(first_names).sum().idxmax()
col_last = df.isin(last_names).sum().idxmax()

out = df.rename(columns={col_first: 'First Name',
                         col_last: 'Last Name',
                         col_age: 'Age',
                         col_gender: 'Gender',
                        })

Output:

  First Name  Gender  Age Last Name
0        Tom    Male   25     Smith
1       John    Male   18       Doe
2       Lucy  Female    7     Black
3       Jane  Female   48       Doe

Intermediates for the Last Name check:

# df.isin(last_names)
       0      1      2      3
0  False  False  False  False
1  False  False  False   True
2  False  False  False   True
3  False  False  False   True

# df.isin(last_names).sum()
0    0
1    0
2    0
3    3
dtype: int64

# df.isin(last_names).sum().idxmax()
3
generalization

You could generalize by defining a dictionary of column name: (dtype, function) to automate the tests and pairing with the maximum likelihood column:

cols = {'Fist Name': (None, lambda x: x.isin(first_names)),
        'Last Name': (None, lambda x: x.isin(last_names)),
        'Age': ('number', lambda x: x.between(0, 100)),
        'Gender': (None, lambda x: x.isin(['Male', 'Female', 'Other'])),
        }

out = df.rename(columns={df.pipe(lambda x: x.select_dtypes(dtype) if dtype else x)
                           .apply(f).sum().idxmax(): c
                         for c, (dtype, f) in cols.items()})

Note that a column will necessarily be found even if there is no match, in case of equal probability the first available column will be used. You might want to add additional tests if the selection criteria could be ambiguous.

Upvotes: 0

rehaqds
rehaqds

Reputation: 2070

To update all the columns names (from the 1st to last column):

df.columns = ['First Name', 'Last Name', 'Age', 'Gender']

To update the name of one or more columns:

df = df.rename(columns={1:"Last Name"})

Upvotes: 0

Related Questions