Reputation: 239
Given two arrays x_train
and x_test
, I want to create a custom sklearn transformer that during fitting, it should learn which columns should be transformed. In particular, I want to log-transform the columns which are right-skewed (skewness > 1).
Here is an example:
import pandas as pd
x_train = pd.DataFrame([[1,5],[1,4],[1,5],[1,4],[3,6],[4,5]])
x_test = pd.DataFrame([[1,4],[2,4],[5,4],[6,4],[10,8],[7,12]])
print(x_train.skew())
print(x_test.skew())
x_train[0]
<- skewedx_train[1]
<- not skewedx_test[0]
<- not skewedx_test[1]
<- skewedSo, ideally, the transformer should transform only the column 0, but not column 1, for both x_train
and x_test
. I tried this:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
def log_skewed(df): #function that log-transforms skewed columns
skewness = df.skew()
for i,s in enumerate(skewness):
if s>1:
df[i] = np.log(df[i])
return df
transformer = FunctionTransformer(log_skewed)
transformer.fit(x_train)
x_train_new = transformer.transform(x_train)
x_test_new = transformer.transform(x_test)
This transformation transforms different columns of x_train
and x_test
.
How can I "teach" to the transformer, during fitting, which columns will we transformed?
Upvotes: 1
Views: 581
Reputation: 12602
Another approach uses the callable option of the column specification in ColumnTransformer
. At fit
time, the callable checks for skewed columns, and saves the output to determine which columns to apply the log to at transform
time.
def skew_identifier(X):
return X.skew() > 1
tfmr = ColumnTransformer(
transformers=[
('log', FunctionTransformer(np.log), skew_identifier)
],
remainder='passthrough',
)
As in afsharov's answer, this assumes pandas as input, but only for .skew()
to work. To work on numpy arrays, you could use scipy.stats.skew
instead.
You can inspect which columns got identified as skewed as tfmr.transformers_[0][2]
(0'th transformer, 2 is the column specification). tfmr.transformers[0][2]
contains the original callable. tfmr._columns
could also be useful, but is private and so may change.
Upvotes: 1
Reputation: 5164
You should not use FunctionTransformer
for this scenario since its fit()
method merely checks the input X
for the correct type and shape. In your example, you do not (and actually cannot) save the information which columns to transform when fitting the transformer to x_train
.
You have to define a custom transformer that does what you want. During fit, it should learn which columns are skewed. And when calling transform
, it should transform the corresponding columns. A solution could look like this:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class SkewnessTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self._columns = None
@property
def columns_(self):
if self._columns is None:
raise Exception('SkewnessTransformer has not been fitted yet')
return self._columns
def fit(self, X, y=None):
skewness = X.skew()
self._columns = skewness.loc[skewness > 1].index.values
return self
def transform(self, X, y=None):
X[self.columns_] = np.log(X[self.columns_])
return X
For your data, you can then use it like this:
x_train = pd.DataFrame([[1,5],[1,4],[1,5],[1,4],[3,6],[4,5]])
x_test = pd.DataFrame([[1,4],[2,4],[5,4],[6,4],[10,8],[7,12]])
skt = SkewnessTransformer()
print(skt.fit_transform(x_train))
# output
0 1
0 0.000000 5
1 0.000000 4
2 0.000000 5
3 0.000000 4
4 1.098612 6
5 1.386294 5
print(skt.transform(x_test))
# output
0 1
0 0.000000 4
1 0.693147 4
2 1.609438 4
3 1.791759 4
4 2.302585 8
5 1.945910 12
print(skt.columns_)
# output: [0]
There are however some constraints to this solution:
DataFrame
as inputskew()
If any of these are undesirable, the solution needs to be modified accordingly.
Upvotes: 3