Problem with column transformations inside pipeline

Question

I'm trying to build a pipeline containing several user-defined column transformations. When creating a new column transformer, I'm inheriting sklearn.base.BaseEstimator and sklearn.base.TransformerMixin, and implementing fit and transform methods. Calling the transformations directly works as expected, but using them as a part of a sklearn.pipeline.Pipeline instance fails giving ambiguous errors.

Let's say I have a pandas.DataFrame instance df containing the following data:

       date   genre
0   9/22/11  horror
1   1/16/04    NULL
2  10/11/96    NULL
3   3/28/13   drama
4   4/22/94   drama

I want to implement two transformers:

DateTransformer, which converts date strings in df['date'] into a numpy.array instance containing year, month, and day for every row.
GenreTransformer, which for every genre in df['genre'], returns 1 if it is not specified ('NULL'), and -1 otherwise.

Here is my code:

class GenreTransformer(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return self

    def transform(self, x):
        x_copy = x.copy()
        x_copy[x_copy != 'NULL'] = -1
        x_copy[x_copy == 'NULL'] = 1
        return x_copy.values

class DateTransformer(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return self

    def transform(self, x):
        x_timestamp = x.apply(pd.to_datetime)
        return np.column_stack((
            x_timestamp.apply(lambda t: t.year).values,
            x_timestamp.apply(lambda t: t.month).values,
            x_timestamp.apply(lambda t: t.day).values,
        ))

Both transformers work correctly:

>>> GenreTransformer().fit_transform(df['genre'])
array([-1, 1, 1, -1, -1])

>>> DateTransformer().fit_transform(df['date'])
array([[2011,    9,   22],
       [2004,    1,   16],
       [1996,   10,   11],
       [2013,    3,   28],
       [1994,    4,   22]])

However, when I merge the transformers using sklearn.compose.ColumnTransformer, and create a pipeline, DateTransformer doesn't work:

column_transformer = ColumnTransformer(
    transformers=[
        ('date_trans', DateTransformer(), ['date']),
        ('genre_trans', GenreTransformer(), ['genre']),
    ],
    remainder='drop',
)

pipe = Pipeline(
    steps=[
        ('union', column_transformer),
        # estimators
    ],
)

>>> pipe.fit(df)
---------------------------------------------------------------------------
Traceback (most recent call last)
...
AttributeError: ("'Series' object has no attribute 'year'", 'occurred at index date')

Interestingly, using pandas.Series.apply instead of mask methods inside GenreTransformer.transform and fitting the pipe also fails:

class GenreTransformer(BaseEstimator, TransformerMixin):
    # ...
    def transform(self, x):
        return x.apply(lambda g: -1 if g != 'NULL' else 1)

>>> pipe.fit(df)
---------------------------------------------------------------------------
Traceback (most recent call last)
...
ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index genre')

So, I guess there is something wrong with applying pandas.Series.apply method inside pipelines. Is there a possibility of a bug on scikit-learn source code? Or there is something I'm doing incorrectly? If so, can you please point out how to implement column transformers, so that I can include them in pipelines?

gmds · Accepted Answer

There is a subtle mistake in your code.

You specified ['date'] for the columns to apply DateTransformer to. When you do so, [it signifies that DateTransformer expects a 2D array-like], which, in this case, is a DataFrame. However, it actually expects a 1D array-like, or a Series.

Therefore, what you did was equivalent to DateTransformer().fit_transform(df[['date']]), when you actually wanted df['date'].

Accordingly, pass ('date_trans', DateTransformer(), 'date') to ColumnTransformer instead and everything should be fine.

Problem with column transformations inside pipeline

Answers (1)

Related Questions