Reputation: 1041
I am using Standardscaler to normalize my dataset, that is I turn each feature into a z-score, by subtracting the mean and dividing by the Std.
I would like to use Standardscaler within sklearn's pipeline and I am wondering how exactly the transformation is applied to X_test. That is, in the code below, when I run pipeline.predict(X_test)
, it is my understanding that the StandardScaler
and SVC()
is run on X_test, but what exactly does Standardscaler
use as the mean and the StD? The ones from the X_Train
or does it compute those only for X_test
? What if, for instance X_test
consists only of 2 variables, the normalization would look a lot different than if I had normalized X_train
and X_test
altogether, right?
steps = [('scaler', StandardScaler()),
('model',SVC())]
pipeline = Pipeline(steps)
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
Upvotes: 3
Views: 3713
Reputation: 1668
Sklearn's pipeline
will apply transformer.fit_transform()
when pipeline.fit()
is called and transformer.transform()
when pipeline.predict()
is called. So for your case, StandardScaler
will be fitted to X_train
and then the mean and stdev from X_train
will be used to scale X_test
.
The transform of X_train
would indeed look different to that of X_train
and X_test
. The extent of the difference would depend on the extent of the difference in the distributions between X_train
and X_test
combined. However, if randomly partitioned from the same original dataset, and of a reasonable size, the distributions of X_train
and X_test
will probably be similar.
Regardless, it is important to treat X_test
as though it is out of sample, in order for it to be a (hopefully) reliable metric for unseen data. Since you don't know the distribution of unseen data, you should pretend you don't know the distribution of X_test
, including the mean and stdev.
Upvotes: 4