user3709260
user3709260

Reputation: 431

Apply function doesn't replace values in dataframe

I'm trying to replace nan values of age based on the median of some corresponding groups. I've made a table called grouped_median using groupby. This is my code:

def fillAges(row, grouped_median):
    return grouped_median.loc[row['Sex'], row['Class'], row['Title']]['Age'] 


df['Age'] = df.apply(lambda x : fillAges(x, grouped_median) if np.isnan(x['Age']) else x['Age'], axis=1)

df

If I print only this part:

print(df.apply(lambda x : fillAges(x, grouped_median) if np.isnan(x['Age']) else x['Age'], axis=1))

The values are correct but then once I look at the df, the nan values are not replaced. I appreciate any help. Thank you!

EDIT: As Nathaniel said, this code works fine. Indeed df is a big dataframe concatenated from both train and test datasets with one extra flag column that is either "train" or "test". Then this is what I was doing:

df[df['flag']=='train']['Age'] = df[df['flag']=='train'].apply(lambda x : fillAges(x, grouped_median) if np.isnan(x['Age']) else x['Age'], axis=1)

and it wouldn't work. This would give me this warning but I thought it was only a warning not meaning that it wasn't doing actually anything!!! "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: pandas.pydata.org/pandas-docs/stable/…

All what I needed to do was to remove the [df['flag']=='train'] part from the left.

I'm still not sure why this approach wasn't working. If anyone has an insight, I appreciate to know it. Thank you

Upvotes: 1

Views: 3036

Answers (1)

Nathaniel Rivera Saul
Nathaniel Rivera Saul

Reputation: 647

You'll have to format for function to take a series and return a series rather than operate on just an element of the series. I've added the function series_op below that should do this for you.

def fillAges(row, grouped_median):
    return grouped_median.loc[row['Sex'], row['Class'], row['Title']]['Age'] 

def series_op(x):
    x['Age'] = fillAges(x, grouped_median) if np.isnan(x['Age']) else x['Age']
    return x


corrected_df = df.apply(series_op, axis=1)

I don't have your data nor grouped_median so I can't replicate your problem. With some test data that I've cooked up, I am able to get this to work correctly, but also able to get yours to work correctly.

Upvotes: 2

Related Questions