fghoussen
fghoussen

Reputation: 565

Handling data and target scaling with both Pipeline and TransformedTargetRegressor on nested regressors

I'd like to use both Pipeline and TransformedTargetRegressor to handle all the scaling (on data and target) on BaggingRegressor and all of its estimators.

My first try works fine (no use of Pipeline and TransformedTargetRegressor)

$ cat test1.py
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingRegressor
from sklearn.svm import SVR

def f(x):
    return x*np.cos(x) + np.random.normal(size=500)*2

def main():
    # Generate random data.
    x = np.linspace(0, 10, 500)
    rng = np.random.RandomState(0)
    rng.shuffle(x)
    x = np.sort(x[:])
    y = f(x)

    # Plot random data.
    fig, axis = plt.subplots(1, 1, figsize=(20, 10))
    axis.plot(x, y, 'o', color='black', markersize=2, label='random data')

    # Create bagging models.
    model = BaggingRegressor(n_estimators=5, base_estimator=SVR())
    x_augmented = np.array([x, x**2, x**3, x**4, x**5]).T
    model.fit(x_augmented, y)

    # Plot intermediate regression estimations.
    axis.plot(x, model.predict(x_augmented), '-', color='red', label=model.__class__.__name__)
    for i, tree in enumerate(model.estimators_):
        y_pred = tree.predict(x_augmented)
        axis.plot(x, y_pred, '--', label='tree '+str(i))

    axis.axis('off')
    axis.legend()
    plt.show()

if __name__ == '__main__':
    main()

Which is OK : bagging regressor is superimposed to all estimators

enter image description here

Now I want to use Pipeline and TransformedTargetRegressor to handle all the scaling on data and targets but it doesn't work, as bagging scale differs from estimator scales :

$ cat test2.py
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingRegressor
from sklearn.svm import SVR
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor

def f(x):
    return x*np.cos(x) + np.random.normal(size=500)*2

def main():
    # Generate random data.
    x = np.linspace(0, 10, 500)
    rng = np.random.RandomState(0)
    rng.shuffle(x)
    x = np.sort(x[:])
    y = f(x)

    # Plot random data.
    fig, axis = plt.subplots(1, 1, figsize=(20, 10))
    axis.plot(x, y, 'o', color='black', markersize=2, label='random data')

    # Create bagging models.
    model = BaggingRegressor(n_estimators=5, base_estimator=SVR())
    x_augmented = np.array([x, x**2, x**3, x**4, x**5]).T
    pipe = Pipeline([('scale', preprocessing.StandardScaler()), ('model', model)])
    treg = TransformedTargetRegressor(regressor=pipe, transformer=preprocessing.MinMaxScaler())
    treg.fit(x_augmented, y)

    # Plot intermediate regression estimations.
    axis.plot(x, treg.predict(x_augmented), '-', color='red', label=model.__class__.__name__)
    for i, tree in enumerate(treg.regressor_['model'].estimators_):
        y_hat = tree.predict(x_augmented)
        y_transformer = preprocessing.MinMaxScaler().fit(y.reshape(-1, 1))
        y_pred = y_transformer.inverse_transform(y_hat.reshape(-1, 1))
        axis.plot(x, y_pred, '--', label='tree '+str(i))
    axis.axis('off')
    axis.legend()
    plt.show()

if __name__ == '__main__':
    main()

enter image description here

How to handle scaling properly on bagging regressor and all of it's nested estimators ?

The diff between the 2 tests is to use Pipeline and TransformedTargetRegressor

$ diff test1.py  test2.py
7a8,10
> from sklearn import preprocessing
> from sklearn.pipeline import Pipeline
> from sklearn.compose import TransformedTargetRegressor
27c30,32
<     model.fit(x_augmented, y)
---
>     pipe = Pipeline([('scale', preprocessing.StandardScaler()), ('model', model)])
>     treg = TransformedTargetRegressor(regressor=pipe, transformer=preprocessing.MinMaxScaler())
>     treg.fit(x_augmented, y)
30,32c35,39
<     axis.plot(x, model.predict(x_augmented), '-', color='red', label=model.__class__.__name__)
<     for i, tree in enumerate(model.estimators_):
<         y_pred = tree.predict(x_augmented)
---
>     axis.plot(x, treg.predict(x_augmented), '-', color='red', label=model.__class__.__name__)
>     for i, tree in enumerate(treg.regressor_['model'].estimators_):
>         y_hat = tree.predict(x_augmented)
>         y_transformer = preprocessing.MinMaxScaler().fit(y.reshape(-1, 1))
>         y_pred = y_transformer.inverse_transform(y_hat.reshape(-1, 1))

EDIT

Tried to use the tranformer_ member of the TransformedTargetRegressor instance : test3 (same as test2 up to the following diff) fails too !...

$ diff test2.py test3.py
38c38
<         y_transformer = preprocessing.MinMaxScaler().fit(y.reshape(-1, 1))
---
>         y_transformer = treg.transformer_

Upvotes: 0

Views: 205

Answers (1)

glemaitre
glemaitre

Reputation: 1003

I don't think there are issues with your code but rather with the plotting part.

# Plot intermediate regression estimations.
    axis.plot(x, treg.predict(x_augmented), '-', color='red', label=model.__class__.__name__)
    for i, tree in enumerate(treg.regressor_['model'].estimators_):
        y_hat = tree.predict(x_augmented)
        y_transformer = preprocessing.MinMaxScaler().fit(y.reshape(-1, 1))
        y_pred = y_transformer.inverse_transform(y_hat.reshape(-1, 1))
        axis.plot(x, y_pred, '--', label='tree '+str(i))

tree here will be an SVR() and you are predicting on x_augmented while in the previous part, x_augmented was scaled with a StandardScaler. Thus the predictions are not corresponding to what you are expecting.

So by changing the code with the following snippet, you will be fine:

# Plot intermediate regression estimations.
axis.plot(x, treg.predict(x_augmented), '-', color='red', label=model.__class__.__name__)
for i, tree in enumerate(treg.regressor_['model'].estimators_):
    x_augmented_scaled = treg.regressor_.named_steps['scale'].transform(x_augmented)
    y_hat = tree.predict(x_augmented_scaled)
    y_transformer = preprocessing.MinMaxScaler().fit(y.reshape(-1, 1))
    y_pred = y_transformer.inverse_transform(y_hat.reshape(-1, 1))
    axis.plot(x, y_pred, '--', label='tree '+str(i))
axis.axis('off')
axis.legend()
plt.show()

Upvotes: 1

Related Questions