Reputation: 2950
I have a dataset where has column types that has type like Primary
, Secondary
.
df
ID types C D
0 101 Primary 2 3
1 103 Primary 6 3
2 108 Primary 10 ?
3 109 Primary 3 12
4 118 Secondary 5 2
5 122 Secondary ? 6
6 123 Secondary 5 6
7 125 Secondary 2 5
I want to replace missing value with median
for each type. Such as-
result_df
ID types C D
0 101 Primary 2 3
1 103 Primary 6 3
2 108 Primary 10 3
3 109 Primary 3 12
4 118 Secondary 5 2
5 122 Secondary 5 6
6 123 Secondary 5 6
7 125 Secondary 2 5
How can do it with Python?
Upvotes: 0
Views: 699
Reputation: 16966
As mentioned by @Mayank Porwal, first convert the missing values into np.nan and then you can apply imputation with sklearn impute methods.
import numpy as np
import pandas as pd
df.replace('?',np.nan,inplace=True)
from sklearn.impute import SimpleImputer
for types,group in df.groupby('types'):
imp = SimpleImputer(missing_values=np.nan, strategy='median')
df.loc[df['types']==types,['C','D']] = imp.fit_transform(group[['C','D']])
Upvotes: 1
Reputation: 34046
Something like this should work:
First replace ?
in your df with actual np.nan
values:
In [1268]: df = df.replace('?',np.nan)
In [1273]: df
Out[1273]:
ID types C D
0 101 Primary 2 3
1 103 Primary 6 3
2 108 Primary 10 NaN
3 109 Primary 3 12
4 118 Secondary 5 2
5 122 Secondary NaN 6
6 123 Secondary 5 6
7 125 Secondary 2 5
For me the
dtypes
is showing asobject
for columnsC
andD
. Hence, I convert these into numeric before finding median. If this is not applicable for you, skip this step and directly run the below command withtransform
function.
In [1274]: df.dtypes
Out[1274]:
ID int64
types object
C object
D object
dtype: object
In order to find median
, convert columns C
and D
into pandas numeric type:
In [1256]: df.C = df.C.apply(pd.to_numeric)
In [1258]: df.D = df.D.apply(pd.to_numeric)
In [1279]: df.dtypes
Out[1279]:
ID int64
types object
C float64
D float64
dtype: object
Now, you can fill nulls with median
of types in both columns C
and D
like below, using groupby
and transform
functions:
In [1265]: df.C = df.C.fillna(df.groupby('types')['C'].transform('median'))
In [1266]: df.D = df.D.fillna(df.groupby('types')['D'].transform('median'))
In [1267]: df
Out[1267]:
ID types C D
0 101 Primary 2.0 3.0
1 103 Primary 6.0 3.0
2 108 Primary 10.0 3.0
3 109 Primary 3.0 12.0
4 118 Secondary 5.0 2.0
5 122 Secondary 5.0 6.0
6 123 Secondary 5.0 6.0
7 125 Secondary 2.0 5.0
Let me know if this helps.
Upvotes: 2