Scikit-learn FunctionTransformer in a train-test setting

Question

Given two arrays x_train and x_test, I want to create a custom sklearn transformer that during fitting, it should learn which columns should be transformed. In particular, I want to log-transform the columns which are right-skewed (skewness > 1).

Here is an example:

import pandas as pd
x_train = pd.DataFrame([[1,5],[1,4],[1,5],[1,4],[3,6],[4,5]])
x_test = pd.DataFrame([[1,4],[2,4],[5,4],[6,4],[10,8],[7,12]])

print(x_train.skew())
print(x_test.skew())

x_train[0] <- skewed
x_train[1] <- not skewed
x_test[0] <- not skewed
x_test[1] <- skewed

So, ideally, the transformer should transform only the column 0, but not column 1, for both x_train and x_test. I tried this:

import numpy as np
from sklearn.preprocessing import FunctionTransformer

def log_skewed(df): #function that log-transforms skewed columns
    skewness = df.skew()
    for i,s in enumerate(skewness):
        if s>1:
            df[i] = np.log(df[i])
    return df
            
transformer = FunctionTransformer(log_skewed)
transformer.fit(x_train)
x_train_new = transformer.transform(x_train)
x_test_new = transformer.transform(x_test)

The problem is that:

This transformation transforms different columns of x_train and x_test.

How can I "teach" to the transformer, during fitting, which columns will we transformed?

afsharov · Accepted Answer

You should not use FunctionTransformer for this scenario since its fit() method merely checks the input X for the correct type and shape. In your example, you do not (and actually cannot) save the information which columns to transform when fitting the transformer to x_train.

You have to define a custom transformer that does what you want. During fit, it should learn which columns are skewed. And when calling transform, it should transform the corresponding columns. A solution could look like this:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np


class SkewnessTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        self._columns = None

    @property
    def columns_(self):
        if self._columns is None:
            raise Exception('SkewnessTransformer has not been fitted yet')
        return self._columns

    def fit(self, X, y=None):
        skewness = X.skew()
        self._columns = skewness.loc[skewness > 1].index.values
        return self

    def transform(self, X, y=None):
        X[self.columns_] = np.log(X[self.columns_])
        return X

For your data, you can then use it like this:

x_train = pd.DataFrame([[1,5],[1,4],[1,5],[1,4],[3,6],[4,5]])
x_test = pd.DataFrame([[1,4],[2,4],[5,4],[6,4],[10,8],[7,12]])

skt = SkewnessTransformer()

print(skt.fit_transform(x_train))
# output
          0  1
0  0.000000  5
1  0.000000  4
2  0.000000  5
3  0.000000  4
4  1.098612  6
5  1.386294  5

print(skt.transform(x_test))
# output
          0   1
0  0.000000   4
1  0.693147   4
2  1.609438   4
3  1.791759   4
4  2.302585   8
5  1.945910  12

print(skt.columns_)
# output: [0]

There are however some constraints to this solution:

the transformer expects a pandas DataFrame as input
it expects the column names in the index of the output of skew()
once fitted, the inputs must have the same column names
the transformer only checks for skewness > 1 (this was but according to your example)

If any of these are undesirable, the solution needs to be modified accordingly.

Scikit-learn FunctionTransformer in a train-test setting

The problem is that:

Answers (2)

Related Questions