Marcel
Marcel

Reputation: 223

Sklearn's SimpleImputer doesn't work in a pipeline?

I have a pandas dataframe that has some NaN values in a particular column:

1291   NaN
1841   NaN
2049   NaN
Name: some column, dtype: float64

And I have made the following pipeline in order to deal with it:

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

scaler = StandardScaler(with_mean = True)
imputer = SimpleImputer(strategy = 'median')
logistic = LogisticRegression()

pipe = Pipeline([('imputer', imputer),
                 ('scaler', scaler), 
                 ('logistic', logistic)])

Now when I pass this pipeline to a RandomizedSearchCV, I get the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

It's actually quite a bit longer than that -- I can post the entire error in an edit if neccesary. Anyway, I am quite sure that this column is the only column that contains NaNs. Moreover, if I switch from SimpleImputer to the (now deprecated) Imputer in the pipeline, the pipeline works just fine in my RandomizedSearchCV. I checked the documentation, but it seems that SimpleImputer is supposed to behave in (nearly) the exact same way as Imputer. What is the difference in behavior? How do use an imputer in my pipeline without using the deprecated Imputer?

Upvotes: 6

Views: 7758

Answers (2)

hanzgs
hanzgs

Reputation: 1616

SimpleImputer in make_pipeline

preprocess_pipeline = make_pipeline(   
    FeatureUnion(transformer_list=[
        ('Handle numeric columns', make_pipeline(
            ColumnSelector(columns=['Amount']),
            SimpleImputer(strategy='constant', fill_value=0),
            StandardScaler()
        )),
        ('Handle categorical data', make_pipeline(
            ColumnSelector(columns=['Type', 'Name', 'Changes']),
            SimpleImputer(strategy='constant', missing_values=' ', fill_value='missing_value'),
            OneHotEncoder(sparse=False)
        ))
    ])
)

SimpleImputer in Pipeline

('features', FeatureUnion ([
     ('Cat Columns', Pipeline([
          ('Category Extractor', TypeSelector(np.number)),
                 ('Impute Zero', SimpleImputer(strategy="constant", fill_value=0))
                                    ])),
('Numerics', Pipeline([
      ('Numeric Extractor', TypeSelector("category")),
          ('Impute Missing', SimpleImputer(strategy="constant", fill_value='missing'))
          ]))        
     ]))

Upvotes: 1

K.K.
K.K.

Reputation: 117

I had the same issue but this addressed it:

imputer = SimpleImputer(strategy = 'median', fill_value = 0)

Upvotes: 0

Related Questions