Reputation: 1315
It seems basic, but I can't see the difference and the advantages or disadvantages between the following 2 ways:
first way:
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
second way:
cross_val_score(clf, X, y, cv=2)
It seems that the 2 ways do the same thing, and the second one is shorter (one line).
What am I missing ?
What are the differences and advantages or disadvantages for each way ?
Upvotes: 0
Views: 709
Reputation: 60321
Arguably, the best way to see such differences is to experiment, although here the situation is rather easy to discern:
clf.score
is in a loop; hence, after the loop execution, it contains just the score in the last validation fold, forgetting everything that has been done before in the previous k-1
folds.
cross_cal_score
, on the other hand, returns the score from all k
folds. It is generally preferable, but it lacks a shuffle
option (which shuffling is always advisable), so you need to manually shuffle the data first, as shown here, or use it with cv=KFold(n_splits=k, shuffle=True)
.
A disadvantage of the for
loop + kfold
method is that it is run serially, while the CV procedure in cross_val_score
can be parallelized in multiple cores with the n_jobs
argument.
A limitation of cross_val_score
is that it cannot be used with multiple metrics, but even in this case you can use cross_validate
, as shown in this thread - not necessary to use for + kfold
.
The use of kfold
in a for
loop gives additional flexibility for cases where neither cross_val_score
nor cross_validate
may be adequate, for example using the scikit-learn wrapper for Keras while still getting all the metrics returned by native Keras during training, as shown here; or if you want to permanently store the different folds in separate variables/files, as shown here.
In short:
cross_val_score
(shuffle first and parallelize).cross_validate
(again, shuffle first and parallelize).kfold
in a for
loop accordingly.Upvotes: 3