How to Scale, Train, and Fit data in classifier Pipeline correctly

Question

I'm trying to scale my data and train a classifier. My current data frame looks like so:

col1 col2 col3 category
---- ---- ---- --------
....

I'm confused on how StandardScaler in my classifier pipeline is affecting my data. Here are my main questions:

Will the Scaler also scale Y_train? Does this actually matter in the context of machine learning?
Will the Scaler automatically scale X_test during prediction? If not, how do I do that using the previously calculated metrics?
Am I missing something fundamental in terms of scaling and splitting?

The docs are a little ambiguous so was hoping someone can clear this up. Thank you so much!

Currently, my pipeline looks like this:

classifier = Pipeline(steps=[("scaler", StandardScaler()), ('svc', SVC(kernel="linear", C=c))])

features = data.loc[:, data.columns != 'category']
categories = data['category']

X_train, X_test, Y_train, Y_test = train_test_split(features, category, train_size=0.7)

classifier.fit(X_train, Y_train)

classifier.predict(X_test)

dx2-66 · Accepted Answer

The scaler in the pipeline won't scale the target, and when doing classification there's hardly any reason to do so. (Sometimes it is useful to transform the target when doing regression though, and this is handled by TransformedTargetRegressor() wrapper).

Once you .fit() your pipeline with the train data, it will apply transformations to the test data when doing .predict(). That's basically the whole point of it.

This is an alright basic example. If you wish to apply different transformations to different kinds of features, you may expand it further using ColumnTransformer().

How to Scale, Train, and Fit data in classifier Pipeline correctly

Answers (2)

Related Questions