learner
learner

Reputation: 541

How to correctly perform cross validation in scikit-learn?

I am trying to do a cross validation on a k-nn classifier and I am confused about which of the following two methods below conducts cross validation correctly.

training_scores = defaultdict(list)
validation_f1_scores = defaultdict(list)
validation_precision_scores = defaultdict(list)
validation_recall_scores = defaultdict(list)
validation_scores = defaultdict(list)

def model_1(seed, X, Y):
    np.random.seed(seed)
    scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro']
    model = KNeighborsClassifier(n_neighbors=13)

    kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
    scores = model_selection.cross_validate(model, X, Y, cv=kfold, scoring=scoring, return_train_score=True)
    print(scores['train_accuracy'])
    training_scores['KNeighbour'].append(scores['train_accuracy'])
    print(scores['test_f1_macro'])
    validation_f1_scores['KNeighbour'].append(scores['test_f1_macro'])
    print(scores['test_precision_macro'])
    validation_precision_scores['KNeighbour'].append(scores['test_precision_macro'])
    print(scores['test_recall_macro'])
    validation_recall_scores['KNeighbour'].append(scores['test_recall_macro'])
    print(scores['test_accuracy'])
    validation_scores['KNeighbour'].append(scores['test_accuracy'])

    print(np.mean(training_scores['KNeighbour']))
    print(np.std(training_scores['KNeighbour']))
    #rest of print statments

It seems that for loop in the second model is redundant.

def model_2(seed, X, Y):
    np.random.seed(seed)
    scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro']
    model = KNeighborsClassifier(n_neighbors=13)

    kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
    for train, test in kfold.split(X, Y):
        scores = model_selection.cross_validate(model, X[train], Y[train], cv=kfold, scoring=scoring, return_train_score=True)
        print(scores['train_accuracy'])
        training_scores['KNeighbour'].append(scores['train_accuracy'])
        print(scores['test_f1_macro'])
        validation_f1_scores['KNeighbour'].append(scores['test_f1_macro'])
        print(scores['test_precision_macro'])
        validation_precision_scores['KNeighbour'].append(scores['test_precision_macro'])
        print(scores['test_recall_macro'])
        validation_recall_scores['KNeighbour'].append(scores['test_recall_macro'])
        print(scores['test_accuracy'])
        validation_scores['KNeighbour'].append(scores['test_accuracy'])

    print(np.mean(training_scores['KNeighbour']))
    print(np.std(training_scores['KNeighbour']))
    # rest of print statments

I am using StratifiedKFold and I am not sure if I need for loop as in model_2 function or does cross_validate function already use the split as we are passing cv=kfold as an argument.

I am not calling fit method, is this OK? Does cross_validate calls that automatically or do I need to call fit before calling cross_validate?

Finally, how can I create confusion matrix? Do I need to create it for each fold, if yes, how can the final/average confusion matrix be calculated?

Upvotes: 4

Views: 7417

Answers (2)

mujjiga
mujjiga

Reputation: 16856

model_1 is correct.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, return_train_score=’warn’, return_estimator=False, error_score=’raise-deprecating’)

where

estimator is an object implementing ‘fit’. It will be called to fit the model on the train folds.

cv: is a cross-validation generator that is used to generated train and test splits.

If you follow the example in the sklearn docs

cv_results = cross_validate(lasso, X, y, cv=3, return_train_score=False) cv_results['test_score'] array([0.33150734, 0.08022311, 0.03531764])

You can see that the model lasso is fitted 3 times once for each fold on train splits and also validated 3 times on test splits. You can see that the test score on validation data are reported.

Cross validation of Keras models

Keras provides wrapper which makes the keras models compatible with sklearn cross_validatation method. You have to wrap the keras model using KerasClassifier

from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import KFold, cross_validate
from keras.models import Sequential
from keras.layers import Dense
import numpy as np

def get_model():
    model = Sequential()
    model.add(Dense(2, input_dim=2, activation='relu')) 
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

model = KerasClassifier(build_fn=get_model, epochs=10, batch_size=8, verbose=0)
kf = KFold(n_splits=3, shuffle=True)

X = np.random.rand(10,2)
y = np.random.rand(10,1)

cv_results = cross_validate(model, X, y, cv=kf, return_train_score=False)

print (cv_results)

Upvotes: 4

desertnaut
desertnaut

Reputation: 60321

The documentation is arguably your best friend in such questions; from the simple example there it should be apparent that you should use neither a for loop nor a call to fit. Adapting the example to use KFold as you do:

from sklearn.model_selection import KFold, cross_validate
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor

X, y = load_boston(return_X_y=True)
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)

model = DecisionTreeRegressor()
scoring=('r2', 'neg_mean_squared_error')

cv_results = cross_validate(model, X, y, cv=kf, scoring=scoring, return_train_score=False)
cv_results

Result:

{'fit_time': array([0.00901461, 0.00563478, 0.00539804, 0.00529385, 0.00638533]),
 'score_time': array([0.00132656, 0.00214362, 0.00134897, 0.00134444, 0.00176597]),
 'test_neg_mean_squared_error': array([-11.15872549, -30.1549505 , -25.51841584, -16.39346535,
        -15.63425743]),
 'test_r2': array([0.7765484 , 0.68106786, 0.73327311, 0.83008371, 0.79572363])}

how can I create confusion matrix? Do I need to create it for each fold

No one can tell you if you need to create a confusion matrix for each fold - it is your choice. If you choose to do so, it may be better to skip cross_validate and do the procedure "manually" - see my answer in How to display confusion matrix and report (recall, precision, fmeasure) for each cross validation fold.

if yes, how can the final/average confusion matrix be calculated?

There is no "final/average" confusion matrix; if you want to calculate anything further than the k ones (one for each k-fold) as described in the linked answer, you need to have available a separate validation set...

Upvotes: 7

Related Questions