Tom
Tom

Reputation: 1063

Using normalized X, y after passing through pipeline

I am using a scikit-learn Pipeline to create a SVM. After creating the model, I'd like to use scikit functions like confusion_matrix, plot_confusion_matrix and permutation_importance, in the past when I've used these functions, I've always passed the normalized X values, but when using the pipeline, the values are normalized when the model is fired. Is the pipeline also normalizing these inside of plot_confusion_matrix and the like?

Here's the code that includes my Pipeline although there's nothing special about it. num_cols are the columns in a dataframe that have numeric features and cat_cols are categorical features.

num_trans = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())])  
cat_trans = OneHotEncoder(handle_unknown='ignore')

#num_cols are numeric features and cat_cols are categorical
preprocess = ColumnTransformer(transformers=[
                ('num', num_trans, num_cols),
                ('cat', cat_trans, cat_cols)])          

X = mdf[feats]
y = mdf[tgt].values

model = Pipeline(steps=[('preprocessor', preprocess), 
                        ('classifier', SVC(C=1, gamma=0.1))])
model.fit(X, y)
pred = model.predict(X)[-1]
cm = confusion_matrix(y, model.predict(X))
plot_confusion_matrix(model, X, y)

Upvotes: 0

Views: 381

Answers (2)

s.dallapalma
s.dallapalma

Reputation: 1315

@Andrea is right.

I'd add that if you want to pass the normalized/scaled data to your plot function, you have to transform it using the learned parameters in your Pipeline. Something like below:

#...
model = Pipeline(steps=[('preprocessor', preprocess), 
                        ('classifier', SVC(C=1, gamma=0.1))])
model.fit(X, y)


preprocessor = model.best_estimator_.steps['preprocessor']
X_preprocessed = preprocessor.transform()

plot_confusion_matrix(model, X_preprocessed, y)

Upvotes: 0

Andrea Ierardi
Andrea Ierardi

Reputation: 429

If I understand well the question, you are asking if the value in the confusion matrix is also normalized, right? When you fit the model variable, which is a Pipeline object, you are actually invoking the steps of preprocessing and fitting the SVC classifier.

What plot_confusion_matrix is actually doing, is taking your model, do the predictions, compare the true value of the target (stored in y) with the predicted values and generate confusion matrix plot. confusion_matrix, and plot_confusion_matrix when invoked are not launching the fit method of the Pipeline class, but just obtaining the prediction with the model.predict() method in the class. For this reason, the results in the confusion matrix are not normalized since the functions do not call any fit method in the underlying Sklearn library (Here you can check the source code of plot_confusion_matrix in the Sklearn library). They objective of the Confusion Matrix is obtaining the comparison between predicted and true values of the target.

NB. Looking at the documentation, plot_confusion_matrix is deprecated in 1.0 and will be removed in 1. I suggest you to use ConfusionMatrixDisplay

Upvotes: 1

Related Questions