Reputation: 139
I am building a machine learning pipeline using sklearn pipeline. In prerprocessing step, I am trying to do two different treatments to two different sting variables 1) One Hot Encoding on BusinessType 2) Mean Encoding on AreaCode as below:
preprocesses_pipeline = make_pipeline (
FeatureUnion (transformer_list = [
("text_features1", make_pipeline(
FunctionTransformer(getBusinessTypeCol, validate=False), CustomOHE()
)),
("text_features2", make_pipeline(
FunctionTransformer(getAreaCodeCol, validate=False)
))
])
)
preprocesses_pipeline.fit_transform(trainDF[X_cols])
With TransformerMixin classes defined as:
class MeanEncoding(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
tmp = X['AreaCode1'].map(X.groupby('AreaCode1')['isFail'].mean())
return tmp.values
class CustomOHE(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
tmp = pd.get_dummies(X)
return tmp.values
and FunctionTransformer functions returningr the desied fields
def getBusinessTypeCol(df):
return df['BusinessType']
def getAreaCodeCol(df):
return df[['AreaCode1','isFail']]
Now when I un the above pipeline, it generates following error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-146-7f3a31a39c81> in <module>()
15 )
16
---> 17 preprocesses_pipeline.fit_transform(trainDF[X_cols])
~\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
281 Xt, fit_params = self._fit(X, y, **fit_params)
282 if hasattr(last_step, 'fit_transform'):
--> 283 return last_step.fit_transform(Xt, y, **fit_params)
284 elif last_step is None:
285 return Xt
~\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
747 Xs = sparse.hstack(Xs).tocsr()
748 else:
--> 749 Xs = np.hstack(Xs)
750 return Xs
751
~\Anaconda3\lib\site-packages\numpy\core\shape_base.py in hstack(tup)
286 return _nx.concatenate(arrs, 0)
287 else:
--> 288 return _nx.concatenate(arrs, 1)
289
290
ValueError: all the input arrays must have same number of dimensions
It seems like error is happening on line having "MeanEncoding" in pipeline as removing it makes the pipeline work fine. Not sure what exactly is wrong with it. Need help.
Upvotes: 4
Views: 849
Reputation: 139
OK, I solve the puzzle. Basically, MeanEncoding()
, after conversion, returns array of format (n,)
while the returned call expect the format in (n,1)
so it can combine this (n,1)
with other already processed arrays of (n,k)
returned by first pipeline, CustomOHE()
. Since numpy
cannot combine (n,)
and (n,k)
it needs to be reshaped into (n,1)
. So, now my MeanEncoding
class looks like as follows:
class MeanEncoding(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
tmp = X['AreaCode1'].map(X.groupby('AreaCode1')['isFail'].mean())
return tmp.values.reshape(len(tmp), 1)
Upvotes: 3