Richard Haydock
Richard Haydock

Reputation: 111

XGBoost (R) CV test vs. training error

I'll preface my question by saying that I am, currently, unable to share my data due to extremely strict confidentiality agreements surrounding it. Hopefully I'll be able to get permission to share the blinded data shortly.

I am struggling to get XGBoost trained properly in R. I have been following the guide here and am so far stuck on step 1, tuning the nrounds parameter. The results I'm getting from my cross validation aren't doing what I'd expect them to do leaving me at a loss for where to proceed.

My data contains 105 obervations, a continuous response variable (histogram in the top left pane of the image in the link below) and 16095 predictor variables. All of the predictors are on the same scale and a histogram of them all is in the top right pane of the image in the link below. The predictor variables are quite zero heavy with 62.82% of all values being 0.

As a separate set of test data I have a further 48 observations. Both data sets have a very similar range in their response variables.

enter image description here

So far I've been able to fit a PLS model and a Random Forest (using the R library ranger). Applying these two models to my test data set I've been able to predict and get a RMSE of 19.133 from PLS and 15.312 from ranger. In the case of ranger successive model fits are proving very stable using 2000 trees and 760 variables each split.

Returning to XGBoost, using the code below, I have been fixing all parameters except nrounds and using the xgb.cv function in the R package xgboost to calculate the training and test errors.

data.train<-read.csv("../Data/Data_Train.csv")
data.test<-read.csv("../Data/Data_Test.csv")

dtrain <- xgb.DMatrix(data = as.matrix(data.train[,-c(1)]), 
label=data.train[,1])
# dtest <- xgb.DMatrix(data = as.matrix(data.test[,-c(1)]), label=data.test[,1]) # Not used here

## Step 1 - tune number of trees using CV function

  eta = 0.1; gamma = 0; max_depth = 15;
  min_child_weight = 1; subsample = 0.8; colsample_bytree = 0.8
  nround=2000
  cv <- xgb.cv(
    params = list(
      ## General Parameters
      booster = "gbtree", # Default
      silent = 0, # Default

      ## Tree Booster Parameters
      eta = eta,
      gamma = gamma,
      max_depth = max_depth,
      min_child_weight = min_child_weight,
      subsample = subsample,
      colsample_bytree = colsample_bytree,
      num_parallel_tree = 1, # Default

      ## Linear Booster Parameters
      lambda = 1, # Default
      lambda_bias = 0, # Default
      alpha = 0, # Default

      ## Task Parameters
      objective = "reg:linear", # Default
      base_score = 0.5, # Default
      # eval_metric = , # Evaluation metric, set based on objective
      nthread = 60
    ),
    data = dtrain,
    nround = nround,
    nfold = 5,
    stratified = TRUE,
    prediction = TRUE,
    showsd = TRUE,
    # early_stopping_rounds = 20,
    # maximize = FALSE,
    verbose = 1
  )

library(ggplot)
plot.df<-data.frame(NRound=as.matrix(cv$evaluation_log)[,1], Train=as.matrix(cv$evaluation_log)[,2], Test=as.matrix(cv$evaluation_log)[,4])
library(reshape2)
plot.df<-melt(plot.df, measure.vars=2:3)
ggplot(data=plot.df, aes(x=NRound, y=value, colour=variable)) + geom_line() + ylab("Mean RMSE")

If this function does what I believe it is does I was hoping to see the training error decrease to a plateau and the test error to decrease then begin to increase again as the model overfits. However the output I'm getting looks like the code below (and also the lower figure in the link above).

##### xgb.cv 5-folds
    iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
       1      94.4494006   1.158343e+00       94.55660      4.811360
       2      85.5397674   1.066793e+00       85.87072      4.993996
       3      77.6640230   1.123486e+00       78.21395      4.966525
       4      70.3846390   1.118935e+00       71.18708      4.759893
       5      63.7045868   9.555162e-01       64.75839      4.668103
---                                                                 
    1996       0.0002458   8.158431e-06       18.63128      2.014352
    1997       0.0002458   8.158431e-06       18.63128      2.014352
    1998       0.0002458   8.158431e-06       18.63128      2.014352
    1999       0.0002458   8.158431e-06       18.63128      2.014352
    2000       0.0002458   8.158431e-06       18.63128      2.014352

Considering how well ranger works I'm inclined to believe that I'm doing something foolish and causing XGBoost to struggle!

Thanks

Upvotes: 3

Views: 3191

Answers (1)

quant
quant

Reputation: 4482

To tune your parameters you can use tuneParams. Here is an example

 task = makeClassifTask(id = id, data = "your data", target = "the name of the column in your data of the y variable")

  # Define the search space
  tuning_options <- makeParamSet(                                     
    makeNumericParam("eta",              lower = 0.1,         upper = 0.4), 
    makeNumericParam("colsample_bytree", lower = 0.5,         upper = 1), 
    makeNumericParam("subsample",        lower = 0.5,         upper = 1),  
    makeNumericParam("min_child_weight", lower = 3,           upper = 10),    
    makeNumericParam("gamma",            lower = 0,           upper = 10), 
    makeNumericParam("lambda",           lower = 0,           upper = 5), 
    makeNumericParam("alpha",            lower = 0,           upper = 5),
    makeIntegerParam("max_depth",        lower = 1,           upper = 10),
    makeIntegerParam("nrounds",          lower = 50,          upper = 300))

 ctrl = makeTuneControlRandom(maxit = 50L)
  rdesc = makeResampleDesc("CV", iters = 3L)
  learner = makeLearner("classif.xgboost", predict.type = "response",par.vals = best_param)

 res = tuneParams(learner = learner,task = task, resampling = rdesc,
                   par.set = tuning_options, control = ctrl,measures = acc)

Of course you can play around with the intervals for your parameters. In the end res will contain the optimal set of parameters for your xgboost and then you can train your xgboost using this parameters. Keep in mind that you can choose other method except apart from cross-validation, try ?makeResampleDesc

I hope it helps

Upvotes: 0

Related Questions