Reputation: 503
I am trying to perform K-Fold Cross Validation and GridSearchCV to optimise my Gradient Boost model - following the link - https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
I have a few questions regarding the screenshot of the Model Report below:
1) How is the accuracy of 0.814365 calculated? Where in the script does it do a train test split? If you change cv_folds=5 to cv_folds=any integer, then the accuracy is still 0.814365. Infact, removing the cv_folds and inputting performCV=False also gives the same accuracy.
(Note my sk learn No CV 80/20 train test gives accuracy of around 0.79-0.80)
2) Again, how is the AUC Score (Train) calculated? And should this be ROC-AUC rather than AUC? My sk learn model gives an AUC of around 0.87. Like the accuracy, this score seems fixed.
3) Why is the mean CV Score so much lower than the AUC (Train) Score? It looks like they are both using roc_auc (my sklearn model gives 0.77 for the ROC AUC)
df = pd.read_csv("123.csv")
target = 'APPROVED' #item to predict
IDcol = 'ID'
def modelfit(alg, ddf, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
#Fit the algorithm on the data
alg.fit(ddf[predictors], ddf['APPROVED'])
#Predict training set:
ddf_predictions = alg.predict(ddf[predictors])
ddf_predprob = alg.predict_proba(ddf[predictors])[:,1]
#Perform cross-validation:
if performCV:
cv_score = cross_validation.cross_val_score(alg, ddf[predictors], ddf['APPROVED'], cv=cv_folds, scoring='roc_auc')
#Print model report:
print ("\nModel Report")
print ("Accuracy : %f" % metrics.accuracy_score(ddf['APPROVED'].values, ddf_predictions))
print ("AUC Score (Train): %f" % metrics.roc_auc_score(ddf['APPROVED'], ddf_predprob))
if performCV:
print ("CV Score : Mean - %.5g | Std - %.5g | Min - %.5g | Max - %.5g" % (npy.mean(cv_score),npy.std(cv_score),npy.min(cv_score),npy.max(cv_score)))
#Print Feature Importance:
if printFeatureImportance:
feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
#Choose all predictors except target & IDcols
predictors = [x for x in df.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, df, predictors)
Upvotes: 0
Views: 544
Reputation: 2980
The main reason your cv_score
appears low is because comparing it to the training accuracy isn't a fair comparison. Your training accuracy is being calculated using the same data that was used to fit the model whereas the cv_score
is the average score from the testing folds within your cross validation. As you can imagine a model will perform better making predictions using data it's already been trained on as opposed to having to make predictions based on new data the model has never seen before.
Your accuracy_score
and auc
calculations are appearing fixed because you are always using the same inputs (ddf["APPROVED"]
, ddf_predictions
and ddf_predprob
) into the calculations. The performCV
section doesn't actually transform any of those datasets, so if you're using the same model, model parameters, and input data you'll get the same predictions that are going into the calculations.
Based on your comments there are a number of reasons the cv_score
accuracy could be lower than the accuracy on your full testing set. One of the main reasons is you're allowing your model to access more data for training when you use the full training set as opposed to using a subset of the training data with each cv
fold. This is especially true if your data size isn't all that large. If your data set isn't large then that data is more important in training and can provide better performance.
Upvotes: 0