Reputation: 1063
I am using a scikit-learn Pipeline
to create a SVM. After creating the model, I'd like to use scikit functions like confusion_matrix
, plot_confusion_matrix
and permutation_importance
, in the past when I've used these functions, I've always passed the normalized X values, but when using the pipeline, the values are normalized when the model is fired. Is the pipeline also normalizing these inside of plot_confusion_matrix
and the like?
Here's the code that includes my Pipeline
although there's nothing special about it. num_cols
are the columns in a dataframe that have numeric features and cat_cols
are categorical features.
num_trans = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
cat_trans = OneHotEncoder(handle_unknown='ignore')
#num_cols are numeric features and cat_cols are categorical
preprocess = ColumnTransformer(transformers=[
('num', num_trans, num_cols),
('cat', cat_trans, cat_cols)])
X = mdf[feats]
y = mdf[tgt].values
model = Pipeline(steps=[('preprocessor', preprocess),
('classifier', SVC(C=1, gamma=0.1))])
model.fit(X, y)
pred = model.predict(X)[-1]
cm = confusion_matrix(y, model.predict(X))
plot_confusion_matrix(model, X, y)
Upvotes: 0
Views: 381
Reputation: 1315
@Andrea is right.
I'd add that if you want to pass the normalized/scaled data to your plot function, you have to transform it using the learned parameters in your Pipeline. Something like below:
#...
model = Pipeline(steps=[('preprocessor', preprocess),
('classifier', SVC(C=1, gamma=0.1))])
model.fit(X, y)
preprocessor = model.best_estimator_.steps['preprocessor']
X_preprocessed = preprocessor.transform()
plot_confusion_matrix(model, X_preprocessed, y)
Upvotes: 0
Reputation: 429
If I understand well the question, you are asking if the value in the confusion matrix is also normalized, right? When you fit the model variable, which is a Pipeline object, you are actually invoking the steps of preprocessing and fitting the SVC classifier.
What plot_confusion_matrix is actually doing, is taking your model, do the predictions, compare the true value of the target (stored in y) with the predicted values and generate confusion matrix plot. confusion_matrix, and plot_confusion_matrix when invoked are not launching the fit method of the Pipeline class, but just obtaining the prediction with the model.predict() method in the class. For this reason, the results in the confusion matrix are not normalized since the functions do not call any fit method in the underlying Sklearn library (Here you can check the source code of plot_confusion_matrix in the Sklearn library). They objective of the Confusion Matrix is obtaining the comparison between predicted and true values of the target.
NB. Looking at the documentation, plot_confusion_matrix is deprecated in 1.0 and will be removed in 1. I suggest you to use ConfusionMatrixDisplay
Upvotes: 1