Reputation: 599
I have a DataFrame called data
with some columns. One of them is Married
and another one is Gender
. Both variables are categorical.
>>> print(data[['Gender', 'Married']].dtypes)
Gender category
Married category
dtype: object
Married
contains no NaN
values, but Gender
contains 12 NaN
values, which I want to impute.
>>> print(data['Gender'].isna().sum())
12
I've made quick analysis that if you have Married='Yes'
, then you're much more likely to have Gender='Male'
. So I want to impute Gender
values in such manner:
Married='Yes' -> Gender='Male'
Married='No' -> Gender='Female'
So I created a dictionary:
dictionary = {'Yes': 'Male', 'No': 'Female'}
Then I wrote a simple code based on fillna()
:
data['Gender'].fillna(data['Married'].map(dictionary), inplace=True)
And it worked... in totally different way then expected. It changed the whole Gender
column! Every single entry now is based on Married
column. Look at these crosstabs:
Before fillna():
Married No Yes
Gender
Female 80 31
Male 129 352
After fillna():
Married No Yes
Gender
Female 212 0
Male 0 392
What can I do to fill NaN Gender
values basing on Married
column?
Upvotes: 3
Views: 4843
Reputation: 88276
You could use np.select
, which returns values from a choicelist
depending on the results of the conditions:
n = df.Gender.isna()
m1 = n & (df.Married == 'Yes')
m2 = n & (df.Married == 'No')
np.select([m1,m2], ['Male','Female'], default=df.Gender)
Upvotes: 2
Reputation: 164773
Your code looks fine. If it doesn't work, there may be a Pandas bug. You can try loc
assignment with Boolean indexing instead:
mask = df['Gender'].isnull()
df.loc[mask, 'Gender'] = df.loc[mask, 'Married'].map(dictionary)
Upvotes: 4