Does cross_val_score not fit the actual input model?

Question

I am working on a project in which I am dealing with a large dataset.

I need to train the SVM classifier within the KFold cross-validation library from Sklearn.

import pandas as pd
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score


x__df_chunk_synth = pd.read_csv('C:/Users/anujp/Desktop/sort/semester 4/ATML/Sem project/atml_proj/Data/x_train_syn.csv')
y_df_chunk_synth = pd.read_csv('C:/Users/anujp/Desktop/sort/semester 4/ATML/Sem project/atml_proj/Data/y_train_syn.csv')

svm_clf = svm.SVC(kernel='poly', gamma=1, class_weight=None, max_iter=20000, C = 100, tol=1e-5)
X = x__df_chunk_synth
Y = y_df_chunk_synth
scores = cross_val_score(svm_clf, X, Y,cv = 5, scoring = 'f1_weighted')
print(scores)
    
pred = svm_clf.predict(chunk_test_x)
accuracy = accuracy_score(chunk_test_y,pred)

print(accuracy)

I am using the above-mentioned code. I understand that I am training my classifier within the function of cross_val_score and hence whenever I am trying to call the classifier outside for the prediction on test data, I am getting an error:

sklearn.exceptions.NotFittedError: This SVC instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Is there any other option of doing the same thing in the correct way?

Please help me with this issue.

yatu · Accepted Answer

Indeed model_selection.cross_val_score uses the input model to fit the data, so it doesn't have to be fitted. However, it does not fit the actual object used as input, rather a copy of it, hence the error This SVC instance is not fitted yet... when trying to predict.

Looking into the source code in cross_validate which is called in cross_val_score, in the scoring step, the estimator goes through clone first:

scores = parallel(
    delayed(_fit_and_score)(
        clone(estimator), X, y, scorers, train, test, verbose, None,
        fit_params, return_train_score=return_train_score,
        return_times=True, return_estimator=return_estimator,
        error_score=error_score)
    for train, test in cv.split(X, y, groups))

Which creates a deep copy of the model (which is why the actual input model is not fitted):

def clone(estimator, *, safe=True):
    """Constructs a new estimator with the same parameters.
    Clone does a deep copy of the model in an estimator
    without actually copying attached data. It yields a new estimator
    with the same parameters that has not been fit on any data.
    ...

Does cross_val_score not fit the actual input model?

Answers (1)

Related Questions