Reputation: 37
I'm learning some basic data science, and I am working on the titanic dataset. The 'Age' column has null values which I'd like to fill with the average of a different column, say 'Pclass' or 'Sex'.
'Pclass' refers to Passenger Class and has three values (1,2,3) based on the whether the passenger had a 1st, 2nd or 3rd class ticket.
I am trying to generalize this process by writing a function that takes in two column names, 'Age' and the column we want to use to aggregate. I can't think of how I can completely generalize this, so for now, let's say I aggregate based on Pclass.
I got the mean age based on Pclass as follows:
# Figure out the mean age for each class
mean_age = round(df_train.groupby('Pclass').mean()['Age'])
mean_age
I tried to define the function as follows (38,30 and 25) are from mean_age:
def fill_age(data, col1, col2):
if data[col1].isnull():
if data[col2] == 1:
return 38
elif data[col2] == 2:
return 30
else:
return 25
else:
return data[col1]
And tried to use .apply():
df_train['Age'] = df_train.apply(fill_age(df_train,'Age','Pclass'), axis = 1)
What am I getting wrong here, and how do I think about this to fix it and further generalize it?
Edit: The following line seems to have worked, but I need it to apply the changes to the dataframe itself, and I can't use 'inplace' with .apply()
df_train.groupby('Pclass')['Age'].apply(lambda x: x.fillna(round(x.mean())))
Upvotes: 1
Views: 527
Reputation: 403120
You shouldn't call the function inside apply
, instead pass the function and the arguments via args=()
or keyword arguments:
df['Age'] = df.apply(fill_age, col1='Age', col2='Pclass', axis=1)
But there's a better way to do this, via vectorization:
df['Age'] = df['Age'].fillna(df.groupby('Pclass')['Age'].transform('mean'))
Upvotes: 2