Replacing missing data in pandas.DataFrame not working

Question

I'm digging in the Kaggle's Titanic excercise.

I have a pandas.DataFrame which column 'Age' has some NaN' values scattered and another column called IsAlone I created whose values are 1 or 0 depending the person was alone on that ship based on a personal rule.

I'm trying to replace the NaN values on column Age for people that were alone with the mean age of those who were alone, and the same way with those who weren't alone. The purpose is just exercise pandas DataFrame, replacing NaN values based on a rule.

I'm doing this to those who were alone:

df_train[(df_train.IsAlone.astype(bool) & df_train.Age.isnull() )].Age = \
    df_train[(df_train.IsAlone.astype(bool) & ~df_train.Age.isnull() )].Age.mean()

And the same way to those who weren't alone:

df_train[(~df_train.IsAlone.astype(bool) & df_train.Age.isnull() )].Age = \
    df_train[(~df_train.IsAlone.astype(bool) & ~df_train.Age.isnull() )].Age.mean()

But this is not working at all, the column Age still have the same NaN values.

Any thoughts on this?

behzad.nouri · Accepted Answer

The problem is that the values are changed on a copy of the original frame. Refer to Returning a view versus a copy for details. As in the documentation:

When setting values in a pandas object, care must be taken to avoid what is called chained indexing.

To change the values on a view of the original frame you may do:

j = df_train.IsAlone.astype(bool) & df_train.Age.isnull()
i = df_train.IsAlone.astype(bool) & ~df_train.Age.isnull()
df_train.loc[j, 'Age'] = df_train.loc[i, 'Age'].mean()

Replacing missing data in pandas.DataFrame not working

Answers (1)

Related Questions