Reputation: 825
Let's suppose the following dataset
code | category | energy | sugars | proteins | |
---|---|---|---|---|---|
0 | 01 | B | 936 | NaN | 7.8 |
1 | 02 | NaN | NaN | 15.0 | NaN |
2 | 03 | A | 1569.0 | 23 | 4.1 |
3 | 04 | NaN | 826 | NaN | 3 |
4 | 05 | B | 1345 | 22 | 5.1 |
5 | 06 | A | NaN | 17 | NaN |
6 | 10 | C | 826 | NaN | 3 |
7 | 11 | C | 1345 | 26 | 5.1 |
8 | 101 | B | NaN | 18 | 6.1 |
9 | 102 | B | 636 | NaN | 7.8 |
10 | 103 | NaN | NaN | 15.0 | NaN |
11 | 104 | A | 1569.0 | 23 | 4.1 |
12 | 105 | C | 813 | NaN | 3.5 |
I would like to make the imputation with SimpleImputer considering the column category
.
Namely, I would like to assign the mean considering the product's category
.
If the product doesn't have a category, so, I would like to consider the mean of products without category
.
So, to complete sugar for code
01.
I am only going to consider all sugars
of products with category
B
code | category | energy | sugars | proteins | |
---|---|---|---|---|---|
0 | 01 | B | 936 | NaN | 7.8 |
4 | 05 | B | 1345 | 22 | 5.1 |
8 | 101 | B | NaN | 18 | 6.1 |
9 | 102 | B | 636 | NaN | 7.8 |
I did something similar, as I show below. But I need to do it with SimpleImputer.
To clarify, in the case below, I completed the NaN without category
with the mean of the column.
for col in df.columns:
if df[col].dtypes == "float64":
df.loc[df[col].isna() & df["category"].notnull(), col] = df["categories"].map(df.groupby("category")[col].mean())
df[col].fillna(df[col].mean(), inplace=True)
Upvotes: 3
Views: 2009
Reputation: 9247
I'm afraid you cannot use only SimpleImputer
for this kind of problem (at least as far as I know).
However, you can create a custom class of Imputer using scikit-learn's very flexible classes BaseEstimator
and TransformerMixin
.
A very basic class would be something like the following:
from sklearn.base import BaseEstimator, TransformerMixin
class WithinGroupMeanImputer(BaseEstimator, TransformerMixin):
def __init__(self, group_var):
self.group_var = group_var
def fit(self, X, y=None):
return self
def transform(self, X):
# the copy leaves the original dataframe intact
X_ = X.copy()
for col in X_.columns:
if X_[col].dtypes == 'float64':
X_.loc[(X[col].isna()) & X_[self.group_var].notna(), col] = X_[self.group_var].map(X_.groupby(self.group_var)[col].mean())
X_[col] = X_[col].fillna(X_[col].mean())
return X_
On your sample dataset:
imp = WithinGroupMeanImputer(group_var='category')
imp.fit(df)
imp.transform(df)
code category energy sugars proteins
0 01 B 936.000000 20.000000 7.800000
1 02 None 1127.848485 15.000000 4.881818
2 03 A 1569.000000 23.000000 4.100000
3 04 None 826.000000 20.916667 3.000000
4 05 B 1345.000000 22.000000 5.100000
5 06 A 1569.000000 17.000000 4.100000
6 10 C 826.000000 26.000000 3.000000
7 11 C 1345.000000 26.000000 5.100000
8 101 B 972.333333 18.000000 6.100000
9 102 B 636.000000 20.000000 7.800000
10 103 None 1127.848485 15.000000 4.881818
11 104 A 1569.000000 23.000000 4.100000
12 105 C 813.000000 26.000000 3.500000
Original data:
import pandas as pd
df = pd.DataFrame({
'code': ['01', '02', '03', '04', '05', '06', '10', '11', '101', '102', '103', '104', '105'],
'category': ['B', None, 'A', None, 'B', 'A', 'C', 'C', 'B', 'B', None, 'A', 'C'],
'energy': [936, None, 1569, 826, 1345, None, 826, 1345, None, 636, None, 1569, 813],
'sugars': [None, 15, 23, None, 22, 17, None, 26, 18, None, 15, 23, None],
'proteins': [7.8, None, 4.1, 3, 5.1, None, 3, 5.1, 6.1, 7.8, None, 4.1, 3.5]
})
Upvotes: 3