different scores when using scikit-learn pipeline vs. doing it manually

Question

Simple example below using minmaxscaler, polyl features and linear regression classifier.

doing via pipeline:

pipeLine = make_pipeline(MinMaxScaler(),PolynomialFeatures(), LinearRegression())

pipeLine.fit(X_train,Y_train)
print(pipeLine.score(X_test,Y_test))
print(pipeLine.steps[2][1].intercept_)
print(pipeLine.steps[2][1].coef_)

0.4433729905419167
3.4067909278765605
[ 0.         -7.60868833  5.87162697]

doing manually:

X_trainScaled = MinMaxScaler().fit_transform(X_train)
X_trainScaledandPoly = PolynomialFeatures().fit_transform(X_trainScaled)

X_testScaled = MinMaxScaler().fit_transform(X_test)
X_testScaledandPoly = PolynomialFeatures().fit_transform(X_testScaled)

reg = LinearRegression()
reg.fit(X_trainScaledandPoly,Y_train)
print(reg.score(X_testScaledandPoly,Y_test))
print(reg.intercept_)
print(reg.coef_)
print(reg.intercept_ == pipeLine.steps[2][1].intercept_)
print(reg.coef_ == pipeLine.steps[2][1].coef_)

0.44099256691782807
3.4067909278765605
[ 0.         -7.60868833  5.87162697]
True
[ True  True  True]

hellpanderr · Accepted Answer

The problem lies in your manual steps, where you do the refitting of the Scaler using test data, you need to fit it on train data and use fitted instance on test data, see here for details: How to normalize the Train and Test data using MinMaxScaler sklearn and StandardScaler before and after splitting data

from sklearn.datasets import make_classification, make_regression
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

X, y = make_regression(n_features=3, n_samples=50, n_informative=1, noise=1)
X_train, X_test, Y_train, Y_test = train_test_split(X, y)

pipeLine = make_pipeline(MinMaxScaler(),PolynomialFeatures(), LinearRegression())

pipeLine.fit(X_train,Y_train)
print(pipeLine.score(X_test,Y_test))
print(pipeLine.steps[2][1].intercept_)
print(pipeLine.steps[2][1].coef_)

scaler = MinMaxScaler().fit(X_train)
X_trainScaled = scaler.transform(X_train)
X_trainScaledandPoly = PolynomialFeatures().fit_transform(X_trainScaled)


X_testScaled = scaler.transform(X_test)
X_testScaledandPoly = PolynomialFeatures().fit_transform(X_testScaled)

reg = LinearRegression()
reg.fit(X_trainScaledandPoly,Y_train)
print(reg.score(X_testScaledandPoly,Y_test))
print(reg.intercept_)
print(reg.coef_)
print(reg.intercept_ == pipeLine.steps[2][1].intercept_)
print(reg.coef_ == pipeLine.steps[2][1].coef_)

different scores when using scikit-learn pipeline vs. doing it manually

doing via pipeline:

doing manually:

Answers (1)

Related Questions