lollercoaster
lollercoaster

Reputation: 16523

Confusion about Sklearn Pipeline & Feature Union together

I tried to make a toy example to illustrate my confusion. I realize this is a silly example with the iris dataset.

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin


class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

# 150 examples, 4 features, labels in {0, 1, 2}
iris = load_iris()
y = iris.target
dfX = pd.DataFrame(iris.data, columns=iris.feature_names)

# feature union transformer list 
transformer_list = [
    ('sepal length (cm)', Pipeline([
        ('selector', ItemSelector(key='sepal length (cm)')),
    ])),

    ('sepal width (cm)', Pipeline([
        ('selector', ItemSelector(key='sepal width (cm)')),
    ])),

    ('petal length (cm)', Pipeline([
        ('selector', ItemSelector(key='petal length (cm)')),
    ])),

    ('petal width (cm)', Pipeline([
        ('selector', ItemSelector(key='petal width (cm)')),
    ])),
]

# create pipeline
pipeline = Pipeline([
    ("union", FeatureUnion(transformer_list=transformer_list)),
    ("svm", SVC(kernel="linear")),
])

# train model
param_grid = dict({})
search = GridSearchCV(estimator=pipeline, param_grid=param_grid, n_jobs=1)
search.fit(dfX, y)
print(search.best_estimator_)

It errors out on:

/Users/me/.virtualenvs/myenv/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_consistent_length(*arrays)
    179     if len(uniques) > 1:
    180         raise ValueError("Found input variables with inconsistent numbers of"
--> 181                          " samples: %r" % [int(l) for l in lengths])
    182
    183

ValueError: Found input variables with inconsistent numbers of samples: [1, 99]

My thinking was that FeatureUnions are in parallel and Pipelines are serial.

How am I thinking about this wrong? What's the right way to blend to two to get rich flows for certain feature types and be able to add transformers piecewise but still combine all together with a FeatureUnion for the final predictor?

Upvotes: 3

Views: 2904

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36619

Yes you are correct in thinking:

My thinking was that FeatureUnions are in parallel and Pipelines are serial.

But here the problem is that your ItemSelector is returning a numpy array with shape (150,). In the final step of FeatureUnion where concatenation of features from different transformers is done, numpy.hstack() is used which stacks the arrays horizontally. Notice that the second dimension is not present in your output. And so when union is done for all transformers it results in an array with shape (600,). This is why you are getting an error in fit method.

The correct shape for returned array should be (150,4). You must add one more transformer to the pipelines which returns data in the correct shape ( or manually change the shape of returned data using numpy.asmatrix() and transpose()).

Please take a look at the documentation example of FeatureUnion. In this example, the output of ItemSelector is passed on to other transformers like TfidfVectorizer() or DictVectorizer() which returns data in 2d array and then it is merged correctly in FeatureUnion. Hope it helps.

Upvotes: 2

Related Questions