Confusion about Sklearn Pipeline & Feature Union together

Question

I tried to make a toy example to illustrate my confusion. I realize this is a silly example with the iris dataset.

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin


class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

# 150 examples, 4 features, labels in {0, 1, 2}
iris = load_iris()
y = iris.target
dfX = pd.DataFrame(iris.data, columns=iris.feature_names)

# feature union transformer list 
transformer_list = [
    ('sepal length (cm)', Pipeline([
        ('selector', ItemSelector(key='sepal length (cm)')),
    ])),

    ('sepal width (cm)', Pipeline([
        ('selector', ItemSelector(key='sepal width (cm)')),
    ])),

    ('petal length (cm)', Pipeline([
        ('selector', ItemSelector(key='petal length (cm)')),
    ])),

    ('petal width (cm)', Pipeline([
        ('selector', ItemSelector(key='petal width (cm)')),
    ])),
]

# create pipeline
pipeline = Pipeline([
    ("union", FeatureUnion(transformer_list=transformer_list)),
    ("svm", SVC(kernel="linear")),
])

# train model
param_grid = dict({})
search = GridSearchCV(estimator=pipeline, param_grid=param_grid, n_jobs=1)
search.fit(dfX, y)
print(search.best_estimator_)

It errors out on:

/Users/me/.virtualenvs/myenv/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_consistent_length(*arrays)
    179     if len(uniques) > 1:
    180         raise ValueError("Found input variables with inconsistent numbers of"
--> 181                          " samples: %r" % [int(l) for l in lengths])
    182
    183

ValueError: Found input variables with inconsistent numbers of samples: [1, 99]

My thinking was that FeatureUnions are in parallel and Pipelines are serial.

How am I thinking about this wrong? What's the right way to blend to two to get rich flows for certain feature types and be able to add transformers piecewise but still combine all together with a FeatureUnion for the final predictor?

Confusion about Sklearn Pipeline & Feature Union together

Answers (1)

Related Questions

Confusion about Sklearn Pipeline &amp; Feature Union together

Answers (1)

Related Questions

Confusion about Sklearn Pipeline & Feature Union together