FarrukhJ
FarrukhJ

Reputation: 139

"ValueError: all the input arrays must have same number of dimensions" Error in sklearn pipeline

I am building a machine learning pipeline using sklearn pipeline. In prerprocessing step, I am trying to do two different treatments to two different sting variables 1) One Hot Encoding on BusinessType 2) Mean Encoding on AreaCode as below:

preprocesses_pipeline = make_pipeline (
    FeatureUnion (transformer_list = [
        ("text_features1",  make_pipeline(
            FunctionTransformer(getBusinessTypeCol, validate=False), CustomOHE()
        )),
        ("text_features2",  make_pipeline(
            FunctionTransformer(getAreaCodeCol, validate=False)
        ))
    ])
)

preprocesses_pipeline.fit_transform(trainDF[X_cols])

With TransformerMixin classes defined as:

class MeanEncoding(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        tmp = X['AreaCode1'].map(X.groupby('AreaCode1')['isFail'].mean())
        return tmp.values

class CustomOHE(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        tmp = pd.get_dummies(X)
        return tmp.values

and FunctionTransformer functions returningr the desied fields

def getBusinessTypeCol(df):
    return df['BusinessType']

def getAreaCodeCol(df):
    return df[['AreaCode1','isFail']]

Now when I un the above pipeline, it generates following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-146-7f3a31a39c81> in <module>()
     15 )
     16 
---> 17 preprocesses_pipeline.fit_transform(trainDF[X_cols])

~\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
    281         Xt, fit_params = self._fit(X, y, **fit_params)
    282         if hasattr(last_step, 'fit_transform'):
--> 283             return last_step.fit_transform(Xt, y, **fit_params)
    284         elif last_step is None:
    285             return Xt

~\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
    747             Xs = sparse.hstack(Xs).tocsr()
    748         else:
--> 749             Xs = np.hstack(Xs)
    750         return Xs
    751 

~\Anaconda3\lib\site-packages\numpy\core\shape_base.py in hstack(tup)
    286         return _nx.concatenate(arrs, 0)
    287     else:
--> 288         return _nx.concatenate(arrs, 1)
    289 
    290 

ValueError: all the input arrays must have same number of dimensions

It seems like error is happening on line having "MeanEncoding" in pipeline as removing it makes the pipeline work fine. Not sure what exactly is wrong with it. Need help.

Upvotes: 4

Views: 849

Answers (1)

FarrukhJ
FarrukhJ

Reputation: 139

OK, I solve the puzzle. Basically, MeanEncoding(), after conversion, returns array of format (n,) while the returned call expect the format in (n,1) so it can combine this (n,1) with other already processed arrays of (n,k) returned by first pipeline, CustomOHE(). Since numpy cannot combine (n,) and (n,k) it needs to be reshaped into (n,1). So, now my MeanEncoding class looks like as follows:

class MeanEncoding(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        tmp = X['AreaCode1'].map(X.groupby('AreaCode1')['isFail'].mean())
        return tmp.values.reshape(len(tmp), 1)

Upvotes: 3

Related Questions