ho_howdy
ho_howdy

Reputation: 73

replacing NaN values in dataframe with pandas

I want to create a function that takes a dataframe and replaces NaN with the mode in categorical columns, and replaces NaN in numerical columns with the mean of that column. If there are more than one mode in the categorical columns, then it should use the first mode.

I have managed to do it with following code:

def exercise4(df):
    df1 = df.select_dtypes(np.number)
    df2 = df.select_dtypes(exclude = 'float')
    mode = df2.mode()
    df3 = df1.fillna(df.mean())
    df4 = df2.fillna(mode.iloc[0,:])
    new_df = [df3,df4]
    df5 = pd.concat(new_df,axis=1)
    new_cols = list(df.columns)
    df6 = df5[new_cols]
    return df6

But i am sure there is a far easier method to do this?

Upvotes: 2

Views: 1743

Answers (4)

George Pipis
George Pipis

Reputation: 1822

You can work as follows:

df = df.apply(lambda x: x.fillna(x.mode()[0]) if (x.dtypes==category) else  x.fillna(x.mean()) )

Upvotes: 0

jezrael
jezrael

Reputation: 862631

You can use:

df = pd.DataFrame({
        'A':list('abcdec'),
         'B':[4,5,4,5,5,4],
         'C':[7,8,9,4,2,3],
         'D':[1,3,5,7,1,0],
         'E':list('bbcdeb'),
})
df.iloc[[1,3], [1,2,0,4]] = np.nan

print (df)
     A    B    C  D    E
0    a  4.0  7.0  1    b
1  NaN  NaN  NaN  3  NaN
2    c  4.0  9.0  5    c
3  NaN  NaN  NaN  7  NaN
4    e  5.0  2.0  1    e
5    c  4.0  3.0  0    b

Idea is use DataFrame.select_dtypes for non numeric columns with DataFrame.mode and select first row by DataFrame.iloc for positions, then count means - non numeric are expluded by default, so possible use Series.append for Series with all values for replacement passed to DataFrame.fillna:

modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
print (both)
A          c
E          b
B       4.25
C       5.25
D    2.83333
dtype: object

df.fillna(both, inplace=True)
print (df)
   A     B     C  D  E
0  a  4.00  7.00  1  b
1  c  4.25  5.25  3  b
2  c  4.00  9.00  5  c
3  c  4.25  5.25  7  b
4  e  5.00  2.00  1  e
5  c  4.00  3.00  0  b

Passed to function with DataFrame.pipe:

def exercise4(df):
    modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
    means = df.mean()
    both = modes.append(means)
    df.fillna(both, inplace=True)
    return df

df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)
   A     B     C  D  E
0  a  4.00  7.00  1  b
1  c  4.25  5.25  3  b
2  c  4.00  9.00  5  c
3  c  4.25  5.25  7  b
4  e  5.00  2.00  1  e
5  c  4.00  3.00  0  b

Another idea is use DataFrame.apply, but is necessary result_type='expand' parameter with test dtypes by types.is_numeric_dtype:

from pandas.api.types import is_numeric_dtype

f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
print (df)
   A     B     C  D  E
0  a  4.00  7.00  1  b
1  c  4.25  5.25  3  b
2  c  4.00  9.00  5  c
3  c  4.25  5.25  7  b
4  e  5.00  2.00  1  e
5  c  4.00  3.00  0  b

Passed to function:

from pandas.api.types import is_numeric_dtype

def exercise4(df):
    f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
    df.fillna(df.apply(f, result_type='expand'), inplace=True)
    return df

df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)

Upvotes: 3

gosuto
gosuto

Reputation: 5741

Actually you have all the ingredients already there! Some of your steps can be chained though making some others obsolete.

Looking at these two lines for example:

mode = df2.mode()
df4 = df2.fillna(mode.iloc[0,:])

You could just replace them with df4 = df2.fillna(df2.mode().iloc[0,:]. Then instead of constantly reassigning new (sub)dataframes to variables, altering them and concatenating them you can make these alterations inplace, meaning they are applied directly to the dataframe in question. Lastly exclude='float' might work in your particular (example) case, but what if there are even more datatypes in the dataframe? A string column maybe?

My suggestion:

def mean_mode(df):
    df.select_dtypes(np.number).fillna(df.mean(), inplace=True)
    df.select_dtypes('category').fillna(df.mode()[0], inplace=True)
    return df

Upvotes: 1

Massifox
Massifox

Reputation: 4487

You can use the _get_numeric_data() method to get the numeric columns (and consequently the categorical ones):

numerical_col = df._get_numeric_data().columns

At this point you only need one line of code using an apply function that runs through the columns:

fixed_df = df.apply(lambda col: col.fillna(col.mean()) if col.name in numerical_col else col.fillna(col.mode()[0]), axis=0)

Upvotes: 1

Related Questions