kilojoules
kilojoules

Reputation: 10093

understanding python xgboost cv

I would like to use the xgboost cv function to find the best parameters for my training data set. I am confused by the api. How do I find the best parameter? Is this similar to the sklearn grid_search cross-validation function? How can I find which of the options for the max_depth parameter ([2,4,6]) was determined optimal?

from sklearn.datasets import load_iris
import xgboost as xgb
iris = load_iris()
DTrain = xgb.DMatrix(iris.data, iris.target)
x_parameters = {"max_depth":[2,4,6]}
xgb.cv(x_parameters, DTrain)
...
Out[6]: 
   test-rmse-mean  test-rmse-std  train-rmse-mean  train-rmse-std
0        0.888435       0.059403         0.888052        0.022942
1        0.854170       0.053118         0.851958        0.017982
2        0.837200       0.046986         0.833532        0.015613
3        0.829001       0.041960         0.824270        0.014501
4        0.825132       0.038176         0.819654        0.013975
5        0.823357       0.035454         0.817363        0.013722
6        0.822580       0.033540         0.816229        0.013598
7        0.822265       0.032209         0.815667        0.013538
8        0.822158       0.031287         0.815390        0.013508
9        0.822140       0.030647         0.815252        0.013494

Upvotes: 23

Views: 52989

Answers (4)

Eran Moshe
Eran Moshe

Reputation: 3208

I would go with hyperOpt

https://github.com/hyperopt/hyperopt

open sourced and worked great for me. If you do choose this and need help, I can elaborate.

When you ask to look over "max_depth":[2,4,6] you can naively solve this by running 3 models, each one with a max depth you want and see which model yields better results.

But "max_depth" is not the only hyper parameter you should consider tune. There are a lot of other hyper parameters, such as: eta (learning rate), gamma, min_child_weight, subsample and so on. Some are continues and some are discrete. (assuming you know your objective functions and evaluation metrics)

you can read about all of them here https://github.com/dmlc/xgboost/blob/master/doc/parameter.md

When you look on all those "parameters" and the size of dimension they create, its huge. You cannot search in it by hand (nor does an "expert" can give you the best arguments to them).

Therefor, hyperOpt gives you a neat solution to this, and builds you a search space which is not exactly random nor grid. All you need to do is define the parameters and their ranges.

You can find a code example here: https://github.com/bamine/Kaggle-stuff/blob/master/otto/hyperopt_xgboost.py

I can tell you from my own experience it worked better then Bayesian Optimization on my models. Give it a few hours/days of trial and error and contact me if you encounter issues you cannot solve.

Good luck!

Upvotes: 8

Rohit
Rohit

Reputation: 159

You can use GridSearchCV with xgboost through xgboost sklearn API

Define your classifier as follows:

from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV 

xgb_model = XGBClassifier(other_params)

test_params = {
 'max_depth':[4,8,12]
}

model = GridSearchCV(estimator = xgb_model,param_grid = test_params)
model.fit(train,target)
print model.best_params_

Upvotes: 15

Deepish
Deepish

Reputation: 786

Sklearn GridSearchCV should be a way to go if you are looking for parameter tuning. You need to just pass the xgb classifier to GridSearchCV and evaluate on the best CV score.

here is nice tutorial which might help you getting started with parameter tuning: http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Upvotes: 9

Aske Doerge
Aske Doerge

Reputation: 1391

Cross-validation is used for estimating the performance of one set of parameters on unseen data.

Grid-search evaluates a model with varying parameters to find the best possible combination of these.

The sklearn docs talks a lot about CV, and they can be used in combination, but they each have very different purposes.

You might be able to fit xgboost into sklearn's gridsearch functionality. Check out the sklearn interface to xgboost for the most smooth application.

Upvotes: 13

Related Questions