Boom
Boom

Reputation: 1315

What is the difference between getting score value from kfold,fit,score vs using cross_val_score?

It seems basic, but I can't see the difference and the advantages or disadvantages between the following 2 ways:

first way:

    kf = KFold(n_splits=2)
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        clf.fit(X_train, y_train)
        clf.score(X_test, y_test)

second way:

cross_val_score(clf, X, y, cv=2)

It seems that the 2 ways do the same thing, and the second one is shorter (one line).

What am I missing ?

What are the differences and advantages or disadvantages for each way ?

Upvotes: 0

Views: 709

Answers (1)

desertnaut
desertnaut

Reputation: 60321

Arguably, the best way to see such differences is to experiment, although here the situation is rather easy to discern:

clf.score is in a loop; hence, after the loop execution, it contains just the score in the last validation fold, forgetting everything that has been done before in the previous k-1 folds.

cross_cal_score, on the other hand, returns the score from all k folds. It is generally preferable, but it lacks a shuffle option (which shuffling is always advisable), so you need to manually shuffle the data first, as shown here, or use it with cv=KFold(n_splits=k, shuffle=True).

A disadvantage of the for loop + kfold method is that it is run serially, while the CV procedure in cross_val_score can be parallelized in multiple cores with the n_jobs argument.

A limitation of cross_val_score is that it cannot be used with multiple metrics, but even in this case you can use cross_validate, as shown in this thread - not necessary to use for + kfold.

The use of kfold in a for loop gives additional flexibility for cases where neither cross_val_score nor cross_validate may be adequate, for example using the scikit-learn wrapper for Keras while still getting all the metrics returned by native Keras during training, as shown here; or if you want to permanently store the different folds in separate variables/files, as shown here.

In short:

  • if you just want the scores for a single metric, stick to cross_val_score (shuffle first and parallelize).
  • if you want multiple metrics, use cross_validate (again, shuffle first and parallelize).
  • if you need a greater degree of control or monitor of the whole CV process, revert to using kfold in a for loop accordingly.

Upvotes: 3

Related Questions