Shrikar
Shrikar

Reputation: 870

Custom pipeline for different data type in scikit learn

I am currently trying to predict whether a kickstarter project will be successful or no depending on a bunch of integer and some text features. I was looking at building a pipeline which would look something like this

Reference : http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

Here is my ItemSelector and pipeline code

class ItemSelector(BaseEstimator, TransformerMixin):    
    def __init__(self, keys):
        self.keys = keys

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.keys]

I verified that the ItemSelector is working as expected by

t = ItemSelector(['cleaned_text'])
t.transform(df)

And it extract the necessary columns

Pipeline

pipeline = Pipeline([
    # Use FeatureUnion to combine the features from subject and body
    ('union', FeatureUnion(
        transformer_list=[
            # Pipeline for pulling features from the post's subject line
            ('text', Pipeline([
                ('selector', ItemSelector(['cleaned_text'])),
                ('counts', CountVectorizer()),
                ('tf_idf', TfidfTransformer())
            ])),

            # Pipeline for pulling ad hoc features from post's body
            ('integer_features', ItemSelector(int_features)),
        ]
    )),

    # Use a SVC classifier on the combined features
    ('svc', SVC(kernel='linear')),
])

But when I run pipeline.fit(X_train, y_train) I receive this error. Any idea how to fix this?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-317e1c402966> in <module>()
----> 1 pipeline.fit(X_train, y_train)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    266             This estimator
    267         """
--> 268         Xt, fit_params = self._fit(X, y, **fit_params)
    269         if self._final_estimator is not None:
    270             self._final_estimator.fit(Xt, y, **fit_params)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    232                 pass
    233             elif hasattr(transform, "fit_transform"):
--> 234                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    235             else:
    236                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    740         self._update_transformer_list(transformers)
    741         if any(sparse.issparse(f) for f in Xs):
--> 742             Xs = sparse.hstack(Xs).tocsr()
    743         else:
    744             Xs = np.hstack(Xs)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
    456 
    457     """
--> 458     return bmat([blocks], format=format, dtype=dtype)
    459 
    460 

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    577                                                     exp=brow_lengths[i],
    578                                                     got=A.shape[0]))
--> 579                     raise ValueError(msg)
    580 
    581                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 81096, expected 1.

Upvotes: 1

Views: 1736

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36619

The ItemSelector is returning a Dataframe, not an array. Thats why the scipy.hstack is throwing up error. Change the ItemSelector as below:

class ItemSelector(BaseEstimator, TransformerMixin):    
    ....
    ....
    ....

    def transform(self, data_dict):
        return data_dict[self.keys].as_matrix()

The error occurs in the integer_features part of your pipeline. For the first part text, the transformers below the ItemSelector support the Dataframe and hence, convert it to array correctly. But the second part only have ItemSelector and returns Dataframe.

Update:

In the comment, you have mentioned that you want to perform some actions on the resultant Dataframe returned from the ItemSelector. So instead of modifying the transform method of the ItemSelector, you can make a new Transformer and append it to the second part of your pipeline.

class DataFrameToArrayTransformer(BaseEstimator, TransformerMixin):    
    def __init__(self):

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        return X.as_matrix()

Then you pipeline should look like this:

pipeline = Pipeline([
    # Use FeatureUnion to combine the features from subject and body
    ('union', FeatureUnion(
        transformer_list=[
            # Pipeline for pulling features from the post's subject line
            ('text', Pipeline([
                ('selector', ItemSelector(['cleaned_text'])),
                ('counts', CountVectorizer()),
                ('tf_idf', TfidfTransformer())
            ])),

            # Pipeline for pulling ad hoc features from post's body
            ('integer', Pipeline([
                ('integer_features', ItemSelector(int_features)),
                ('array', DataFrameToArrayTransformer()),
            ])),
        ]
    )),

    # Use a SVC classifier on the combined features
    ('svc', SVC(kernel='linear')),
])

The main thing to understand here is that FeatureUnion will only handle 2-D arrays when combining them, so any other type like DataFrame may present a problem there.

Upvotes: 1

Related Questions