Reputation: 91
So I have an IF statement in python which essentially looks to change null values in a dataset to an average based off two other columns.
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
Sex = cols[2]
if pd.isnull(Age):
if Pclass == 1 and Sex == 0:
return train.loc[(train["Pclass"] == 1)
& (train["Sex_male"] == 0)]["Age"].mean()
if Pclass == 2 and Sex == 0:
return train.loc[(train["Pclass"] == 2)
& (train["Sex_male"] == 0)]["Age"].mean()
if Pclass == 3 and Sex == 0:
return train.loc[(train["Pclass"] == 3)
& (train["Sex_male"] == 0)]["Age"].mean()
if Pclass == 1 and Sex == 1:
return train.loc[(train["Pclass"] == 1)
& (train["Sex_male"] == 1)]["Age"].mean()
if Pclass == 2 and Sex == 1:
return train.loc[(train["Pclass"] == 2)
& (train["Sex_male"] == 1)]["Age"].mean()
if Pclass == 3 and Sex == 1:
return train.loc[(train["Pclass"] == 3)
& (train["Sex_male"] == 1)]["Age"].mean()
else:
return Age
So here i'm trying to fill in nans using the average age of male/females in certain passenger classes. I feel like there would be a much better way of writing this, especially if I was to come across a much bigger dataset.
For reference the train
df is the main df with all of the data. For some reason I couldn't get this code to work with a subset of train passed through using the cols
argument.
The question here is essentially: how can I write this in a much simpler way & is there a way I could write this IF statement if my dataset was MUCH larger?
Upvotes: 1
Views: 148
Reputation: 4635
PCLASS_VALUES = [
[],
]
SEX_VALUES = [
[],
]
return train.loc[(train["Pclass"] == PCLASS_VALUES[Pclass][Sex]) & (train["Sex_male"] == SEX_VALUES[Pclass][Sex])]["Age"].mean()
Upvotes: 0
Reputation: 77827
It appears to me that all you need to do is parameterize your inner if
:
if pd.isnull(Age):
return train.loc[(train["Pclass"] == Pclass)
& (train["Sex_male"] == Sex)]["Age"].mean()
Upvotes: 9