Anish
Anish

Reputation: 31

fillna() not allowing floating values

I'm testing a simple imputation method on the side using a copy of my dataset. I'm essentially trying to impute missing values with categorical means grouped by the target variable.

df_test_2 = train_df.loc[:,['Survived','Age']].copy()  #copy of dataset for testing

#creating impute function
def impute(df,variable):
    if 'Survived'==0: df[variable] = df[variable].fillna(30.7)
    else: df[variable] = df[variable].fillna(28.3)

#imputing
impute(df_test_2,'Age')

The output is that the imputation is successful, but the values added are 30 and 28 instead of 30.7 and 28.3. 'Age' is float64.

Thank you

Edit: I simply copied the old code for calling the function here and corrected it now. Wasn't the issue in my original code; problem persists.

Upvotes: 0

Views: 205

Answers (2)

Paul Brennan
Paul Brennan

Reputation: 2696

Have a look at this to see what may be going on

To test it I set up a simple case

import pandas as pd
import numpy as np

data = {'Survived' : [0,1,1,0,0,1], 'Age' :[12.2,45.4,np.nan,np.nan,64.3,44.3]}
df = pd.DataFrame(data)
df

This got the data set

    Survived    Age
0   0           12.2
1   1           45.4
2   1           NaN
3   0           NaN
4   0           64.3
5   1           44.3

I ran your function exactly

def impute(df,variable):
    if 'Survived'==0: df[variable] = df[variable].fillna(30.7)
    else: df[variable] = df[variable].fillna(28.3)

and this yielded this result

    Survived    Age
0   0.          12.2
1   1           45.4
2   1           28.3
3   0           28.3
4   0           64.3
5   1           44.3

As you can see on the index 3 the row age got filled with the wrong value. The problem is this 'Survived'==0. This is always going to be false. You are checking to see if the string is 0 and it is not.

What you may want is

df2 = df[df['Survived'] == 0].fillna(30.7)
df3 = df[df['Survived'] == 1].fillna(28.3)
dfout = df2.append(df3)

and the output is

    Survived    Age
0   0           12.2
3   0           30.7
4   0           64.3
1   1           45.4
2   1           28.3
5   1           44.3

Upvotes: 2

sebashc3712
sebashc3712

Reputation: 65

Anish

I think is better to use the method apply() available in pandas. This method applies (in rows or in columns) a custom function over a dataframe.

I let you one post: Stack Question

Documentation pandas: Doc Apply df

regards,

Upvotes: 0

Related Questions