Reputation: 73
I want to create a function that takes a dataframe and replaces NaN with the mode in categorical columns, and replaces NaN in numerical columns with the mean of that column. If there are more than one mode in the categorical columns, then it should use the first mode.
I have managed to do it with following code:
def exercise4(df):
df1 = df.select_dtypes(np.number)
df2 = df.select_dtypes(exclude = 'float')
mode = df2.mode()
df3 = df1.fillna(df.mean())
df4 = df2.fillna(mode.iloc[0,:])
new_df = [df3,df4]
df5 = pd.concat(new_df,axis=1)
new_cols = list(df.columns)
df6 = df5[new_cols]
return df6
But i am sure there is a far easier method to do this?
Upvotes: 2
Views: 1743
Reputation: 1822
You can work as follows:
df = df.apply(lambda x: x.fillna(x.mode()[0]) if (x.dtypes==category) else x.fillna(x.mean()) )
Upvotes: 0
Reputation: 862631
You can use:
df = pd.DataFrame({
'A':list('abcdec'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':list('bbcdeb'),
})
df.iloc[[1,3], [1,2,0,4]] = np.nan
print (df)
A B C D E
0 a 4.0 7.0 1 b
1 NaN NaN NaN 3 NaN
2 c 4.0 9.0 5 c
3 NaN NaN NaN 7 NaN
4 e 5.0 2.0 1 e
5 c 4.0 3.0 0 b
Idea is use DataFrame.select_dtypes
for non numeric columns with DataFrame.mode
and select first row by DataFrame.iloc
for positions, then count means
- non numeric are expluded by default, so possible use Series.append
for Series with all values for replacement passed to DataFrame.fillna
:
modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
print (both)
A c
E b
B 4.25
C 5.25
D 2.83333
dtype: object
df.fillna(both, inplace=True)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Passed to function with DataFrame.pipe
:
def exercise4(df):
modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
df.fillna(both, inplace=True)
return df
df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Another idea is use DataFrame.apply
, but is necessary result_type='expand'
parameter with test dtypes by types.is_numeric_dtype
:
from pandas.api.types import is_numeric_dtype
f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Passed to function:
from pandas.api.types import is_numeric_dtype
def exercise4(df):
f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
return df
df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)
Upvotes: 3
Reputation: 5741
Actually you have all the ingredients already there! Some of your steps can be chained though making some others obsolete.
Looking at these two lines for example:
mode = df2.mode()
df4 = df2.fillna(mode.iloc[0,:])
You could just replace them with df4 = df2.fillna(df2.mode().iloc[0,:]
. Then instead of constantly reassigning new (sub)dataframes to variables, altering them and concatenating them you can make these alterations inplace
, meaning they are applied directly to the dataframe in question. Lastly exclude='float'
might work in your particular (example) case, but what if there are even more datatypes in the dataframe? A string column maybe?
My suggestion:
def mean_mode(df):
df.select_dtypes(np.number).fillna(df.mean(), inplace=True)
df.select_dtypes('category').fillna(df.mode()[0], inplace=True)
return df
Upvotes: 1
Reputation: 4487
You can use the _get_numeric_data()
method to get the numeric columns (and consequently the categorical ones):
numerical_col = df._get_numeric_data().columns
At this point you only need one line of code using an apply function that runs through the columns:
fixed_df = df.apply(lambda col: col.fillna(col.mean()) if col.name in numerical_col else col.fillna(col.mode()[0]), axis=0)
Upvotes: 1