Reputation: 31
I'm testing a simple imputation method on the side using a copy of my dataset. I'm essentially trying to impute missing values with categorical means grouped by the target variable.
df_test_2 = train_df.loc[:,['Survived','Age']].copy() #copy of dataset for testing
#creating impute function
def impute(df,variable):
if 'Survived'==0: df[variable] = df[variable].fillna(30.7)
else: df[variable] = df[variable].fillna(28.3)
#imputing
impute(df_test_2,'Age')
The output is that the imputation is successful, but the values added are 30 and 28 instead of 30.7 and 28.3. 'Age' is float64.
Thank you
Edit: I simply copied the old code for calling the function here and corrected it now. Wasn't the issue in my original code; problem persists.
Upvotes: 0
Views: 205
Reputation: 2696
Have a look at this to see what may be going on
To test it I set up a simple case
import pandas as pd
import numpy as np
data = {'Survived' : [0,1,1,0,0,1], 'Age' :[12.2,45.4,np.nan,np.nan,64.3,44.3]}
df = pd.DataFrame(data)
df
This got the data set
Survived Age
0 0 12.2
1 1 45.4
2 1 NaN
3 0 NaN
4 0 64.3
5 1 44.3
I ran your function exactly
def impute(df,variable):
if 'Survived'==0: df[variable] = df[variable].fillna(30.7)
else: df[variable] = df[variable].fillna(28.3)
and this yielded this result
Survived Age
0 0. 12.2
1 1 45.4
2 1 28.3
3 0 28.3
4 0 64.3
5 1 44.3
As you can see on the index 3 the row age got filled with the wrong value. The problem is this 'Survived'==0. This is always going to be false. You are checking to see if the string is 0 and it is not.
What you may want is
df2 = df[df['Survived'] == 0].fillna(30.7)
df3 = df[df['Survived'] == 1].fillna(28.3)
dfout = df2.append(df3)
and the output is
Survived Age
0 0 12.2
3 0 30.7
4 0 64.3
1 1 45.4
2 1 28.3
5 1 44.3
Upvotes: 2
Reputation: 65
Anish
I think is better to use the method apply() available in pandas. This method applies (in rows or in columns) a custom function over a dataframe.
I let you one post: Stack Question
Documentation pandas: Doc Apply df
regards,
Upvotes: 0