Reputation: 31
Can someone please let me know, if this is the correct way to calculate the cross-validated precision of my classifier? I divided my dataset into xtrain and ytrain for training data and xtest & ytest for the test set.
Building the model:
RFC = RandomForestClassifier(n_estimators=100)
Fitting it to training set:
RFC.fit(xtrain, ytrain)
This is the part I am unsure about:
scores = cross_val_score(RFC, xtest, ytest, cv = 10, scoring='precision')
Using the code above, would "scores" give me the precision on my model which was trained on the Training data? I am very afraid that I used to wrong code and that I am fitting the model to xtest, because my recall and precision score for my test data is HIGHER than the scores for my training data which I couldn't figure out why!
Upvotes: 3
Views: 9219
Reputation: 138
You don't actually have to do the fitting of the model yourself when you compute the cross-validation score.
The correct (simpler) way to do the cross-validated score is to just create the model like you do
RFC = RandomForestClassifier(n_estimators=100)
Then just compute the score
scores = cross_val_score(RFC, xtrain, ytrain, cv = 10, scoring='precision')
Usually in machine learning / statistics, you split your data on training and test set (as you did). After this the training data is used to validate the model (training parameters, cross-validation, etc.) and the final model is then tested on the test set. Thus, you wont actually use your test set in the cross-validation, only in the final phase when you want to have the final accuracy of the model.
Separating the data to training and test sets and doing the cross-validation on the training data has the advantage that you wont be overfitting model parameters (with Cross-Validation) when you have the separate test set which is only used in the final phase.
You can learn more here: cross_val_score and Cross-Validation
Upvotes: 4