Reputation: 85
I am building a Machine Learning model pipeline. I have a custom function which will change the value of a specific column. I have defined custom transformer and it's working fine separately. But If I call it from pipeline it's throwing me error.
Sample Dataframe
df = pd.DataFrame({'y': [4,5,6], 'a':[3,2,3], 'b' : [2,3,4]})
import numpy as np
import pandas as pd
import sklearn
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
class Extractor(BaseEstimator, TransformerMixin):
def __init__(self):
return None
def fit(self, x, y=None):
return self
def map_values(self, x):
if x in [1.0,2.0,3.0]:
return "Class A"
if x in [4.0,5.0,6.0]:
return "Class B"
if x in [7.0,8.0]:
return "Class C"
if x in [9.0,10.0]:
return "Class D"
else:
return "Other"
def transform(self, X):
return self
def fit_transform(self, X):
X = X.copy()
X = X.apply(lambda x : self.map_values(x))
return X
e = Extractor()
e.fit_transform(df['a'])
0 Class A
1 Clas C
2 Other
3 Class B
Name: a, dtype: object
Pipeline
features = ['a']
numeric_features=['b']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))])
custom_transformer = Pipeline(steps=[
('map_value', Extractor())])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('time',custom_transformer, features)])
X_new = df[['a','b']]
y_new = df['y']
X_transform = preprocessor.fit_transform(X_new,y_new)
TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 'Pipeline(steps=[('map_value', Extractor())])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't.
I want to make the custom processor working in the the pipeline.
Upvotes: 0
Views: 2719
Reputation: 2478
so I tried working with your code and found some issues. Below is the updated code and some remarks.
First of all, after copy pasting your code and adding the missing import for SimpleImputer
, I could not reproduce your error. Instead it showed the error: "TypeError: fit_transform() takes 2 positional arguments but 3 were given". After some research, I found this fix here and adjusted your method.
But now it returned the error: "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
The problem is, that your Extractor requires/expects a Pandas.Series, where each entry is an number so that it can be mapped to one of your classes. So that means its one-dimensional like a list. This works well with df['a'], which is basically [3,2,3].
But when you are trying to use df[['a','b']] with it, you use two columns, which means there are two lists, one is [3,2,3] and the other for b is [2,3,4].
So here you need to decide what you actually want your Extractor to do. My first thought was, that you could put a and b into a list, so that it forms [3,2,3,2,3,4], but then you will end up with 6 classes, which does not match your three y entries.
Therefore I believe you want to implement some method, which takes a list of classes and perhaps picks the most represented class or something.
For example you need to map a[0] & b[0] to y[0], so Class A & Class A = 4 (to match with y[0]).
import numpy as np
import pandas as pd
import sklearn
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# Added import
from sklearn.impute import SimpleImputer
class Extractor(BaseEstimator, TransformerMixin):
def __init__(self):
return None
def fit(self, x, y=None):
return self
def map_values(self, x):
if x in [1.0,2.0,3.0]:
return "Class A"
if x in [4.0,5.0,6.0]:
return "Class B"
if x in [7.0,8.0]:
return "Class C"
if x in [9.0,10.0]:
return "Class D"
else:
return "Other"
def transform(self, X):
return self
def fit_transform(self, X, y=0):
# TypeError: fit_transform() takes 2 positional arguments but 3 were given
# Adjusted: https://intellipaat.com/community/2966/fittransform-takes-2-positional-arguments-but-3-were-given-with-labelbinarizer
# ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# -> compare df['a'].shape and X_new.shape. df['a'] is basically [3,2,3] and X_new is [[3,2,3],[2,3,4]]. Using X_new['a'] or X_new['b'] works.
# But with both columns, its not clear which should be mapped -> therefore ambiguous
X = X.copy()
X = X.apply(lambda x : self.map_values(x))
return X
df = pd.DataFrame({'y': [4,5,6], 'a':[3,2,3], 'b' : [2,3,4]})
e = Extractor()
e.fit_transform(df['a'])
features = ['a']
numeric_features=['b']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))])
custom_transformer = Pipeline(steps=[
('map_value', Extractor())])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('time',custom_transformer, features)])
X_new = df[['a','b']]
y_new = df['y']
# Triedpd.Series(X_new.values.flatten().tolist()), but tuple index out of range, because of course there are 6 x and only 3 y values now.
X_transform = preprocessor.fit_transform(pd.Series(X_new.values.flatten().tolist()),y_new)
Upvotes: 1