Reputation: 1076
I'm trying to scale my data and train a classifier. My current data frame looks like so:
col1 col2 col3 category
---- ---- ---- --------
....
I'm confused on how StandardScaler in my classifier pipeline is affecting my data. Here are my main questions:
The docs are a little ambiguous so was hoping someone can clear this up. Thank you so much!
Currently, my pipeline looks like this:
classifier = Pipeline(steps=[("scaler", StandardScaler()), ('svc', SVC(kernel="linear", C=c))])
features = data.loc[:, data.columns != 'category']
categories = data['category']
X_train, X_test, Y_train, Y_test = train_test_split(features, category, train_size=0.7)
classifier.fit(X_train, Y_train)
classifier.predict(X_test)
Upvotes: 0
Views: 625
Reputation: 259
as far as i know when you try to execute pipeline.fit(), it also fit the scaler which means in case of standard deviation it will extract mean and standard deviation based on train data and when you execute pipeline.predict() on test data it just applied those extract mean and standard deviation to your test data.
Upvotes: 0
Reputation: 2851
The scaler in the pipeline won't scale the target, and when doing classification there's hardly any reason to do so. (Sometimes it is useful to transform the target when doing regression though, and this is handled by TransformedTargetRegressor()
wrapper).
Once you .fit()
your pipeline with the train data, it will apply transformations to the test data when doing .predict()
. That's basically the whole point of it.
This is an alright basic example. If you wish to apply different transformations to different kinds of features, you may expand it further using ColumnTransformer()
.
Upvotes: 3