Jesmar Scicluna
Jesmar Scicluna

Reputation: 23

Implementation of Cross-validation

I am confused since many individuals have their own approach to apply the cross-validation. For instance, some apply it on the whole dataset and some apply it on the training set.

My question is whether the below code is appropriate to implement cross-validation and make predictions from such model while having Cross-validation being applied?

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold

model= GradientBoostingClassifier(n_estimators= 10,max_depth = 10, random_state = 0)#sepcifying the model
cv = KFold(n_splits=5, shuffle=True)


from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score

#X -the whole dataset
#y - the whole dataset but target attributes only

y_pred = cross_val_predict(model, X, y, cv=cv)
scores = cross_val_score(model, X, y, cv=cv)

Upvotes: -1

Views: 452

Answers (1)

B200011011
B200011011

Reputation: 4258

You need to have a test set to evaluate performance on completely unseen data even for cross validation. Performance tuning should not be done on this test set to avoid data leakage.

Split data into two segments train and test. There are various CV methods such as K-Fold, Stratified K-Fold etc. Visualization and further reading material here,

https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html

https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

In K-Fold CV training data is split into K sets. Then for each fold, K-1 of the fold is trained and the remaining one is used for performance evaluation.

The image and further detail about cross validation, train/validation/test split etc. can be found here.

https://scikit-learn.org/stable/modules/cross_validation.html

enter image description here

Visualization of K-Fold cross validation for 3 classes,

enter image description here

Upvotes: 1

Related Questions