dg S
dg S

Reputation: 85

Custom preprocessor in Sklearn pipeline

I am building a Machine Learning model pipeline. I have a custom function which will change the value of a specific column. I have defined custom transformer and it's working fine separately. But If I call it from pipeline it's throwing me error.

Sample Dataframe

df = pd.DataFrame({'y': [4,5,6], 'a':[3,2,3], 'b' : [2,3,4]})
import numpy as np
import pandas as pd
import sklearn
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
class Extractor(BaseEstimator, TransformerMixin):
  def __init__(self):
    return None
  def fit(self, x, y=None):
    return self
  def map_values(self, x):
    if x in [1.0,2.0,3.0]:
      return "Class A"
    if x in [4.0,5.0,6.0]:
      return "Class B"
    if x in [7.0,8.0]:
      return "Class C"
    if x in [9.0,10.0]:
      return "Class D"
    else:
      return "Other"
  def transform(self, X):
    return self
  def fit_transform(self, X):
    X = X.copy()
    X = X.apply(lambda x : self.map_values(x))
    return X

e = Extractor()
e.fit_transform(df['a'])
0    Class A
1     Clas C
2      Other
3    Class B
Name: a, dtype: object

Pipeline

features = ['a']
numeric_features=['b']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))])
custom_transformer = Pipeline(steps=[
    ('map_value', Extractor())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('time',custom_transformer, features)])

X_new = df[['a','b']]
y_new = df['y']

X_transform = preprocessor.fit_transform(X_new,y_new)

TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 'Pipeline(steps=[('map_value', Extractor())])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't.

I want to make the custom processor working in the the pipeline.

Upvotes: 0

Views: 2719

Answers (1)

Kim Tang
Kim Tang

Reputation: 2478

so I tried working with your code and found some issues. Below is the updated code and some remarks.

First of all, after copy pasting your code and adding the missing import for SimpleImputer, I could not reproduce your error. Instead it showed the error: "TypeError: fit_transform() takes 2 positional arguments but 3 were given". After some research, I found this fix here and adjusted your method.

But now it returned the error: "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

The problem is, that your Extractor requires/expects a Pandas.Series, where each entry is an number so that it can be mapped to one of your classes. So that means its one-dimensional like a list. This works well with df['a'], which is basically [3,2,3].

But when you are trying to use df[['a','b']] with it, you use two columns, which means there are two lists, one is [3,2,3] and the other for b is [2,3,4].

So here you need to decide what you actually want your Extractor to do. My first thought was, that you could put a and b into a list, so that it forms [3,2,3,2,3,4], but then you will end up with 6 classes, which does not match your three y entries.

Therefore I believe you want to implement some method, which takes a list of classes and perhaps picks the most represented class or something.

For example you need to map a[0] & b[0] to y[0], so Class A & Class A = 4 (to match with y[0]).

import numpy as np
import pandas as pd
import sklearn
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# Added import
from sklearn.impute import SimpleImputer

class Extractor(BaseEstimator, TransformerMixin):
  def __init__(self):
    return None
  def fit(self, x, y=None):
    return self
  def map_values(self, x):
    if x in [1.0,2.0,3.0]:
      return "Class A"
    if x in [4.0,5.0,6.0]:
      return "Class B"
    if x in [7.0,8.0]:
      return "Class C"
    if x in [9.0,10.0]:
      return "Class D"
    else:
      return "Other"

  def transform(self, X):
    return self

  def fit_transform(self, X, y=0):
    # TypeError: fit_transform() takes 2 positional arguments but 3 were given
    # Adjusted: https://intellipaat.com/community/2966/fittransform-takes-2-positional-arguments-but-3-were-given-with-labelbinarizer

    # ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
    # -> compare df['a'].shape and X_new.shape. df['a'] is basically [3,2,3] and X_new is [[3,2,3],[2,3,4]]. Using X_new['a'] or X_new['b'] works. 
    # But with both columns, its not clear which should be mapped -> therefore ambiguous
    X = X.copy()
    X = X.apply(lambda x : self.map_values(x))
    return X

df = pd.DataFrame({'y': [4,5,6], 'a':[3,2,3], 'b' : [2,3,4]})

e = Extractor()
e.fit_transform(df['a'])


features = ['a']
numeric_features=['b']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))])
custom_transformer = Pipeline(steps=[
    ('map_value', Extractor())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('time',custom_transformer, features)])

X_new = df[['a','b']]
y_new = df['y']

# Triedpd.Series(X_new.values.flatten().tolist()), but tuple index out of range, because of course there are 6 x and only 3 y values now.
X_transform = preprocessor.fit_transform(pd.Series(X_new.values.flatten().tolist()),y_new)

Upvotes: 1

Related Questions