Reputation: 3
I have a data frame that contains four columns that do not have a label (so they are df[0],df[1], df[2], and df[3] based on position).
I need to interpret what the column names are based on the data. For example purposes, let's say the columns needed are 'First Name', 'Last Name', 'Age', and 'Gender'. Currently I assume that this is the order of the data, but I need to be able to accept the data in any order (based on varied user input). The data is not labeled in a way that would be useful for me to say column 3 is age or column 1 is gender, because there is no header.
I know, based on an allowed set of values, what each column can contain. Age can be from 0-99. Gender can be Male, Female, or Other. The first and last names are from known lists. Is there a way I can set the name of the columns based on this information? I know I can count the number of values in each column that are within the expected value for age, gender, name, and determine that column 1 in the data frame has 100% values expected for age, but I cannot figure out how to rename the column based on this information.
Here is an example of the data:
0 | 1 | 2 | 3 |
---|---|---|---|
Tom | Male | 25 | Smith |
John | Male | 18 | Doe |
Lucy | Female | 7 | Black |
Jane | Female | 48 | Doe |
Upvotes: 0
Views: 82
Reputation: 262234
Since you have criteria to determine the likelihood of a column to be a specific header, you could apply tests on each one and select the one with highest probability using sum
+idxmax
, then rename
:
first_names = ['Eve', 'Jane', 'John', 'Lucy', 'Tom']
last_names = ['Black', 'Doe', 'White']
# Age is a numeric column with values between 0-100
col_age = (df.select_dtypes('number')
.apply(lambda x: x.between(0, 100)).sum().idxmax()
)
# Gender contains Male/Female/Other
col_gender = df.isin(['Male', 'Female', 'Other']).sum().idxmax()
# First/Last names contain values from list of known first/last names
col_first = df.isin(first_names).sum().idxmax()
col_last = df.isin(last_names).sum().idxmax()
out = df.rename(columns={col_first: 'First Name',
col_last: 'Last Name',
col_age: 'Age',
col_gender: 'Gender',
})
Output:
First Name Gender Age Last Name
0 Tom Male 25 Smith
1 John Male 18 Doe
2 Lucy Female 7 Black
3 Jane Female 48 Doe
Intermediates for the Last Name check:
# df.isin(last_names)
0 1 2 3
0 False False False False
1 False False False True
2 False False False True
3 False False False True
# df.isin(last_names).sum()
0 0
1 0
2 0
3 3
dtype: int64
# df.isin(last_names).sum().idxmax()
3
You could generalize by defining a dictionary of column name: (dtype, function) to automate the tests and pairing with the maximum likelihood column:
cols = {'Fist Name': (None, lambda x: x.isin(first_names)),
'Last Name': (None, lambda x: x.isin(last_names)),
'Age': ('number', lambda x: x.between(0, 100)),
'Gender': (None, lambda x: x.isin(['Male', 'Female', 'Other'])),
}
out = df.rename(columns={df.pipe(lambda x: x.select_dtypes(dtype) if dtype else x)
.apply(f).sum().idxmax(): c
for c, (dtype, f) in cols.items()})
Note that a column will necessarily be found even if there is no match, in case of equal probability the first available column will be used. You might want to add additional tests if the selection criteria could be ambiguous.
Upvotes: 0
Reputation: 2070
To update all the columns names (from the 1st to last column):
df.columns = ['First Name', 'Last Name', 'Age', 'Gender']
To update the name of one or more columns:
df = df.rename(columns={1:"Last Name"})
Upvotes: 0