Irving Pérez
Irving Pérez

Reputation: 183

Using SimpleImputer to impute values by class

I'm trying to build a custom transformer wrapped around SimpleImputer. My idea is to apply the SimpleImputer transformer, but grouping for a categorical column of choice. And I want it to be a sklearn transformer so it can be applied to a pipeline.

Letter Value
A 10
A 20
B np.nan
B 1
A np.nan
B 2

After applying CustomImputer(column= "Letter", strategy= "mean")

Letter Value
A 10
A 20
B 1.5
B 1
A 15
B 2

Here's my current draft:

class ConditionalImputer(BaseEstimator, TransformerMixin):
    def __init__(self, categoria, strat): # no *args or **kargs
        self.categoria = categoria
        self.strat = strat
        
    def fit(self, X, y=None):
        self.names = X[self.categoria].unique()
        
        return self # nothing else to do
    
    def transform(self, X, y=None):
        
        X_new = pd.DataFrame()
        X_copy = X
        X = X.drop(self.categoria, axis= 1)
        
        imputer = SimpleImputer(strategy= self.strat)
        
        for cat in self.names:
            subset = X[X_copy[self.categoria] == cat]
            
            X_subset = imputer.fit_transform(subset)
            X_subset = pd.DataFrame(X_subset, columns = X.columns)
            
            X_new = pd.concat([X_new, X_subset])
            
        return X_new

It is supposed to take a numeric dataframe, with one category column which gets removed during the transformation, and returns the desired dataframe. When I call the fit method it seems to work fine, but when I try call the transform it gives me an error: Traceback (most recent call last):

  File "C:\Users\Irving\AppData\Local\Temp\ipykernel_11560\3888183145.py", line 1, in <cell line: 1>
    con_test.transform(X_train[num])

  File "C:\Users\Irving\AppData\Local\Temp\ipykernel_11560\3089403585.py", line 20, in transform
    X_subset = imputer.fit_transform(subset)

  File "C:\Users\Irving\PyCharm Projects\Kitten\venv\lib\site-packages\sklearn\base.py", line 867, in fit_transform
    return self.fit(X, **fit_params).transform(X)

  File "C:\Users\Irving\PyCharm Projects\Kitten\venv\lib\site-packages\sklearn\impute\_base.py", line 364, in fit
    X = self._validate_input(X, in_fit=True)

  File "C:\Users\Irving\PyCharm Projects\Kitten\venv\lib\site-packages\sklearn\impute\_base.py", line 319, in _validate_input
    raise ve

  File "C:\Users\Irving\PyCharm Projects\Kitten\venv\lib\site-packages\sklearn\impute\_base.py", line 302, in _validate_input
    X = self._validate_data(

  File "C:\Users\Irving\PyCharm Projects\Kitten\venv\lib\site-packages\sklearn\base.py", line 577, in _validate_data
    X = check_array(X, input_name="X", **check_params)

  File "C:\Users\Irving\PyCharm Projects\Kitten\venv\lib\site-packages\sklearn\utils\validation.py", line 909, in check_array
    raise ValueError(

ValueError: Found array with 0 sample(s) (shape=(0, 26)) while a minimum of 1 is required by SimpleImputer.

I tried backtracking and tweaking some bits but I have no idea where it goes wrong. It's my first time trying to write a custom transformer, so all help will be greatly appreciated.

Upvotes: 1

Views: 377

Answers (1)

amiola
amiola

Reputation: 3026

One possible solution to the problem might be the following:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

df = pd.DataFrame({'Letter': ['A', 'A', 'B', 'B', 'A', 'B'], 
    'Value': [10, 20, np.nan, 1, np.nan, 2]}
)

class CustomImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean_value = X.groupby(by='Letter')['Value'].mean()
        return self

    def transform(self, X, y=None):
        for letter in X['Letter'].unique():
            X.loc[(X['Value'].isna()) & (X['Letter'] == letter), 'Value'] = self.mean_value[letter]
        return X

imputer = CustomImputer()
imputer.fit_transform(df)

Basically, it computes the mean value by letter (without relying on SimpleImputer), which then applies in .transform() on the missing values.

enter image description here

Upvotes: 1

Related Questions