Filling in null values in one column based on aggregate of a different column

Question

I'm learning some basic data science, and I am working on the titanic dataset. The 'Age' column has null values which I'd like to fill with the average of a different column, say 'Pclass' or 'Sex'.

'Pclass' refers to Passenger Class and has three values (1,2,3) based on the whether the passenger had a 1st, 2nd or 3rd class ticket.

I am trying to generalize this process by writing a function that takes in two column names, 'Age' and the column we want to use to aggregate. I can't think of how I can completely generalize this, so for now, let's say I aggregate based on Pclass.

I got the mean age based on Pclass as follows:

# Figure out the mean age for each class
mean_age = round(df_train.groupby('Pclass').mean()['Age'])
mean_age

I tried to define the function as follows (38,30 and 25) are from mean_age:

def fill_age(data, col1, col2):
    if data[col1].isnull():
        if data[col2] == 1:
            return 38
        elif data[col2] == 2:
            return 30
        else:
            return 25
    else:
        return data[col1]

And tried to use .apply():

df_train['Age'] = df_train.apply(fill_age(df_train,'Age','Pclass'), axis = 1)

What am I getting wrong here, and how do I think about this to fix it and further generalize it?

Edit: The following line seems to have worked, but I need it to apply the changes to the dataframe itself, and I can't use 'inplace' with .apply()

df_train.groupby('Pclass')['Age'].apply(lambda x: x.fillna(round(x.mean())))

cs95 · Accepted Answer

You shouldn't call the function inside apply, instead pass the function and the arguments via args=() or keyword arguments:

df['Age'] = df.apply(fill_age, col1='Age', col2='Pclass', axis=1)

But there's a better way to do this, via vectorization:

df['Age'] = df['Age'].fillna(df.groupby('Pclass')['Age'].transform('mean'))

Filling in null values in one column based on aggregate of a different column

Answers (1)

Related Questions