Reputation: 1099
I'm trying to build a pipeline containing several user-defined column
transformations. When creating a new column transformer, I'm inheriting
sklearn.base.BaseEstimator
and sklearn.base.TransformerMixin
, and
implementing fit
and transform
methods. Calling the transformations
directly works as expected, but using them as a part of a
sklearn.pipeline.Pipeline
instance fails giving ambiguous errors.
Let's say I have a pandas.DataFrame
instance df
containing the following data:
date genre
0 9/22/11 horror
1 1/16/04 NULL
2 10/11/96 NULL
3 3/28/13 drama
4 4/22/94 drama
I want to implement two transformers:
DateTransformer
, which converts date strings in df['date']
into a numpy.array
instance containing year, month, and day for every row.
GenreTransformer
, which for every genre in df['genre']
, returns 1 if
it is not specified ('NULL'), and -1 otherwise.
Here is my code:
class GenreTransformer(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, x):
x_copy = x.copy()
x_copy[x_copy != 'NULL'] = -1
x_copy[x_copy == 'NULL'] = 1
return x_copy.values
class DateTransformer(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, x):
x_timestamp = x.apply(pd.to_datetime)
return np.column_stack((
x_timestamp.apply(lambda t: t.year).values,
x_timestamp.apply(lambda t: t.month).values,
x_timestamp.apply(lambda t: t.day).values,
))
Both transformers work correctly:
>>> GenreTransformer().fit_transform(df['genre'])
array([-1, 1, 1, -1, -1])
>>> DateTransformer().fit_transform(df['date'])
array([[2011, 9, 22],
[2004, 1, 16],
[1996, 10, 11],
[2013, 3, 28],
[1994, 4, 22]])
However, when I merge the transformers using
sklearn.compose.ColumnTransformer
, and create a pipeline,
DateTransformer
doesn't work:
column_transformer = ColumnTransformer(
transformers=[
('date_trans', DateTransformer(), ['date']),
('genre_trans', GenreTransformer(), ['genre']),
],
remainder='drop',
)
pipe = Pipeline(
steps=[
('union', column_transformer),
# estimators
],
)
>>> pipe.fit(df)
---------------------------------------------------------------------------
Traceback (most recent call last)
...
AttributeError: ("'Series' object has no attribute 'year'", 'occurred at index date')
Interestingly, using pandas.Series.apply
instead
of mask methods inside GenreTransformer.transform
and fitting the pipe also
fails:
class GenreTransformer(BaseEstimator, TransformerMixin):
# ...
def transform(self, x):
return x.apply(lambda g: -1 if g != 'NULL' else 1)
>>> pipe.fit(df)
---------------------------------------------------------------------------
Traceback (most recent call last)
...
ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index genre')
So, I guess there is something wrong with applying pandas.Series.apply
method
inside pipelines. Is there a possibility of a bug on scikit-learn source code?
Or there is something I'm doing incorrectly? If so, can you please point out
how to implement column transformers, so that I can include them in pipelines?
Upvotes: 2
Views: 5122
Reputation: 19905
There is a subtle mistake in your code.
You specified ['date']
for the columns to apply DateTransformer
to. When you do so, [it signifies that DateTransformer
expects a 2D array-like], which, in this case, is a DataFrame
. However, it actually expects a 1D array-like, or a Series
.
Therefore, what you did was equivalent to DateTransformer().fit_transform(df[['date']])
, when you actually wanted df['date']
.
Accordingly, pass ('date_trans', DateTransformer(), 'date')
to ColumnTransformer
instead and everything should be fine.
Upvotes: 4