user3768495
user3768495

Reputation: 4647

how to integrate a pandas operation into sklearn pipeline

I have a simple operation on pandas dataframe like this:

# initialization
dct = {1: 'A', 2:'B', 3: 'C'}
df = pd.DataFrame({'id': [1,2,3], 'value':[7,8,9]})
# actual transformation
df['newid'] = df.id.map(dct)

And I would like to put this transformation as a part of a sklearn pipeline. I found a few tutorials here, here, and here. But I just can't get it work for me. Here's one version of many versions I have tried:

# initialization
dct = {1: 'A', 2:'B', 3: 'C'}
df = pd.DataFrame({'id': [1,2,3], 'value':[7,8,9]})

# define a class similar to those in the tutorials
class idMapper(BaseEstimator, TransformerMixin):
    def __init__(self, key='id'):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[key].map(dct)

# Apply the transformation
idMapper.fit_transform(df)

The error message is like this: TypeError: fit_transform() missing 1 required positional argument: 'X'. Can anyone help me fix this issue and get it working? Thanks!

Upvotes: 0

Views: 415

Answers (1)

Jan K
Jan K

Reputation: 4150

See below a corrected version of your code. Explanation given in the comments.

dct = {1: 'A', 2:'B', 3: 'C'}
df = pd.DataFrame({'id': [1,2,3], 'value':[7,8,9]})

# define a class similar to those in the tutorials
class idMapper(BaseEstimator, TransformerMixin):
    def __init__(self, key='id'):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key].map(dct)  # <--- self.key

# Apply the transformation
idMapper().fit_transform(df)  # <--- need to instantiate

Upvotes: 3

Related Questions