ThePortakal
ThePortakal

Reputation: 239

Scikit-learn FunctionTransformer in a train-test setting

Given two arrays x_train and x_test, I want to create a custom sklearn transformer that during fitting, it should learn which columns should be transformed. In particular, I want to log-transform the columns which are right-skewed (skewness > 1).

Here is an example:

import pandas as pd
x_train = pd.DataFrame([[1,5],[1,4],[1,5],[1,4],[3,6],[4,5]])
x_test = pd.DataFrame([[1,4],[2,4],[5,4],[6,4],[10,8],[7,12]])

print(x_train.skew())
print(x_test.skew())

So, ideally, the transformer should transform only the column 0, but not column 1, for both x_train and x_test. I tried this:

import numpy as np
from sklearn.preprocessing import FunctionTransformer

def log_skewed(df): #function that log-transforms skewed columns
    skewness = df.skew()
    for i,s in enumerate(skewness):
        if s>1:
            df[i] = np.log(df[i])
    return df
            
transformer = FunctionTransformer(log_skewed)
transformer.fit(x_train)
x_train_new = transformer.transform(x_train)
x_test_new = transformer.transform(x_test)

The problem is that:

This transformation transforms different columns of x_train and x_test.

How can I "teach" to the transformer, during fitting, which columns will we transformed?

Upvotes: 1

Views: 581

Answers (2)

Ben Reiniger
Ben Reiniger

Reputation: 12602

Another approach uses the callable option of the column specification in ColumnTransformer. At fit time, the callable checks for skewed columns, and saves the output to determine which columns to apply the log to at transform time.

def skew_identifier(X):
    return X.skew() > 1

tfmr = ColumnTransformer(
    transformers=[
        ('log', FunctionTransformer(np.log), skew_identifier)
    ],
    remainder='passthrough',
)

As in afsharov's answer, this assumes pandas as input, but only for .skew() to work. To work on numpy arrays, you could use scipy.stats.skew instead.

You can inspect which columns got identified as skewed as tfmr.transformers_[0][2] (0'th transformer, 2 is the column specification). tfmr.transformers[0][2] contains the original callable. tfmr._columns could also be useful, but is private and so may change.

Upvotes: 1

afsharov
afsharov

Reputation: 5164

You should not use FunctionTransformer for this scenario since its fit() method merely checks the input X for the correct type and shape. In your example, you do not (and actually cannot) save the information which columns to transform when fitting the transformer to x_train.

You have to define a custom transformer that does what you want. During fit, it should learn which columns are skewed. And when calling transform, it should transform the corresponding columns. A solution could look like this:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np


class SkewnessTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        self._columns = None

    @property
    def columns_(self):
        if self._columns is None:
            raise Exception('SkewnessTransformer has not been fitted yet')
        return self._columns

    def fit(self, X, y=None):
        skewness = X.skew()
        self._columns = skewness.loc[skewness > 1].index.values
        return self

    def transform(self, X, y=None):
        X[self.columns_] = np.log(X[self.columns_])
        return X

For your data, you can then use it like this:

x_train = pd.DataFrame([[1,5],[1,4],[1,5],[1,4],[3,6],[4,5]])
x_test = pd.DataFrame([[1,4],[2,4],[5,4],[6,4],[10,8],[7,12]])

skt = SkewnessTransformer()

print(skt.fit_transform(x_train))
# output
          0  1
0  0.000000  5
1  0.000000  4
2  0.000000  5
3  0.000000  4
4  1.098612  6
5  1.386294  5

print(skt.transform(x_test))
# output
          0   1
0  0.000000   4
1  0.693147   4
2  1.609438   4
3  1.791759   4
4  2.302585   8
5  1.945910  12

print(skt.columns_)
# output: [0]

There are however some constraints to this solution:

  • the transformer expects a pandas DataFrame as input
  • it expects the column names in the index of the output of skew()
  • once fitted, the inputs must have the same column names
  • the transformer only checks for skewness > 1 (this was but according to your example)

If any of these are undesirable, the solution needs to be modified accordingly.

Upvotes: 3

Related Questions