Reputation: 111
I'll preface my question by saying that I am, currently, unable to share my data due to extremely strict confidentiality agreements surrounding it. Hopefully I'll be able to get permission to share the blinded data shortly.
I am struggling to get XGBoost trained properly in R. I have been following the guide here and am so far stuck on step 1, tuning the nrounds parameter. The results I'm getting from my cross validation aren't doing what I'd expect them to do leaving me at a loss for where to proceed.
My data contains 105 obervations, a continuous response variable (histogram in the top left pane of the image in the link below) and 16095 predictor variables. All of the predictors are on the same scale and a histogram of them all is in the top right pane of the image in the link below. The predictor variables are quite zero heavy with 62.82% of all values being 0.
As a separate set of test data I have a further 48 observations. Both data sets have a very similar range in their response variables.
So far I've been able to fit a PLS model and a Random Forest (using the R library ranger). Applying these two models to my test data set I've been able to predict and get a RMSE of 19.133 from PLS and 15.312 from ranger. In the case of ranger successive model fits are proving very stable using 2000 trees and 760 variables each split.
Returning to XGBoost, using the code below, I have been fixing all parameters except nrounds and using the xgb.cv function in the R package xgboost to calculate the training and test errors.
data.train<-read.csv("../Data/Data_Train.csv")
data.test<-read.csv("../Data/Data_Test.csv")
dtrain <- xgb.DMatrix(data = as.matrix(data.train[,-c(1)]),
label=data.train[,1])
# dtest <- xgb.DMatrix(data = as.matrix(data.test[,-c(1)]), label=data.test[,1]) # Not used here
## Step 1 - tune number of trees using CV function
eta = 0.1; gamma = 0; max_depth = 15;
min_child_weight = 1; subsample = 0.8; colsample_bytree = 0.8
nround=2000
cv <- xgb.cv(
params = list(
## General Parameters
booster = "gbtree", # Default
silent = 0, # Default
## Tree Booster Parameters
eta = eta,
gamma = gamma,
max_depth = max_depth,
min_child_weight = min_child_weight,
subsample = subsample,
colsample_bytree = colsample_bytree,
num_parallel_tree = 1, # Default
## Linear Booster Parameters
lambda = 1, # Default
lambda_bias = 0, # Default
alpha = 0, # Default
## Task Parameters
objective = "reg:linear", # Default
base_score = 0.5, # Default
# eval_metric = , # Evaluation metric, set based on objective
nthread = 60
),
data = dtrain,
nround = nround,
nfold = 5,
stratified = TRUE,
prediction = TRUE,
showsd = TRUE,
# early_stopping_rounds = 20,
# maximize = FALSE,
verbose = 1
)
library(ggplot)
plot.df<-data.frame(NRound=as.matrix(cv$evaluation_log)[,1], Train=as.matrix(cv$evaluation_log)[,2], Test=as.matrix(cv$evaluation_log)[,4])
library(reshape2)
plot.df<-melt(plot.df, measure.vars=2:3)
ggplot(data=plot.df, aes(x=NRound, y=value, colour=variable)) + geom_line() + ylab("Mean RMSE")
If this function does what I believe it is does I was hoping to see the training error decrease to a plateau and the test error to decrease then begin to increase again as the model overfits. However the output I'm getting looks like the code below (and also the lower figure in the link above).
##### xgb.cv 5-folds
iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
1 94.4494006 1.158343e+00 94.55660 4.811360
2 85.5397674 1.066793e+00 85.87072 4.993996
3 77.6640230 1.123486e+00 78.21395 4.966525
4 70.3846390 1.118935e+00 71.18708 4.759893
5 63.7045868 9.555162e-01 64.75839 4.668103
---
1996 0.0002458 8.158431e-06 18.63128 2.014352
1997 0.0002458 8.158431e-06 18.63128 2.014352
1998 0.0002458 8.158431e-06 18.63128 2.014352
1999 0.0002458 8.158431e-06 18.63128 2.014352
2000 0.0002458 8.158431e-06 18.63128 2.014352
Considering how well ranger works I'm inclined to believe that I'm doing something foolish and causing XGBoost to struggle!
Thanks
Upvotes: 3
Views: 3191
Reputation: 4482
To tune your parameters you can use tuneParams
. Here is an example
task = makeClassifTask(id = id, data = "your data", target = "the name of the column in your data of the y variable")
# Define the search space
tuning_options <- makeParamSet(
makeNumericParam("eta", lower = 0.1, upper = 0.4),
makeNumericParam("colsample_bytree", lower = 0.5, upper = 1),
makeNumericParam("subsample", lower = 0.5, upper = 1),
makeNumericParam("min_child_weight", lower = 3, upper = 10),
makeNumericParam("gamma", lower = 0, upper = 10),
makeNumericParam("lambda", lower = 0, upper = 5),
makeNumericParam("alpha", lower = 0, upper = 5),
makeIntegerParam("max_depth", lower = 1, upper = 10),
makeIntegerParam("nrounds", lower = 50, upper = 300))
ctrl = makeTuneControlRandom(maxit = 50L)
rdesc = makeResampleDesc("CV", iters = 3L)
learner = makeLearner("classif.xgboost", predict.type = "response",par.vals = best_param)
res = tuneParams(learner = learner,task = task, resampling = rdesc,
par.set = tuning_options, control = ctrl,measures = acc)
Of course you can play around with the intervals for your parameters. In the end res
will contain the optimal set of parameters for your xgboost
and then you can train your xgboost
using this parameters. Keep in mind that you can choose other method except apart from cross-validation, try ?makeResampleDesc
I hope it helps
Upvotes: 0