DN1
DN1

Reputation: 218

Does SimpleImputer remove features?

I have a dataset of 284 features I am trying to impute using scikit-learn, however I get an error where the number of features changes to 283:

imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")
imputer = imputer.fit(data.iloc[:,0:284])
df[:,0:284] = imputer.transform(df[:,0:284])
X = MinMaxScaler().fit_transform(df)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-150-849be5be8fcb> in <module>
      1 imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")
      2 imputer = imputer.fit(data.iloc[:,0:284])
----> 3 df[:,0:284] = imputer.transform(df[:,0:284])
      4 X = MinMaxScaler().fit_transform(df)

~\Anaconda3\envs\environment\lib\site-packages\sklearn\impute\_base.py in transform(self, X)
    411         if X.shape[1] != statistics.shape[0]:
    412             raise ValueError("X has %d features per sample, expected %d"
--> 413                              % (X.shape[1], self.statistics_.shape[0]))
    414 
    415         # Delete the invalid columns if strategy is not constant

ValueError: X has 283 features per sample, expected 284

I don't understand how this is reaching 283 features, I assume on fitting it's finding features that have all 0s or something and deciding to drop that, but I can't find documentation which tells me how to make sure those features are still kept. I am not a programmer so not sure if I am missing something else that's obvious or if I'm better looking into another method?

Upvotes: 4

Views: 2662

Answers (2)

luisvenezian
luisvenezian

Reputation: 501

I was dealing with the same situation and i got my solution by adding this transformation before the SimpleImputer mean strategy

imputer = SimpleImputer(strategy = 'constant', fill_value = 0)
df_prepared_to_mean_or_anything_else = imputer.fit_transform(previous_df)

What does it do? Fills everything missing with the value specified on parameter fill_value

Upvotes: 0

gil
gil

Reputation: 106

This could happen if you have a feature without any values, from https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html: 'Columns which only contained missing values at fit are discarded upon transform if strategy is not “constant”'. You can tell if this is indeed the problem by using a high 'verbose' value when constructing the imputer:

sklearn.impute.SimpleImputer(..., verbose=100,...)

It will spit sth like: UserWarning: Deleting features without observed values: [ ... ]

Upvotes: 9

Related Questions