Alex
Alex

Reputation: 21

R Partial Dependence Plots for XGBoost

I've run an XGBoost on a sparse matrix and am trying to display some partial dependence plots. I've been using PDP package but am open to suggestions. Below code is a reproducible example of what I'm trying to do.

# load required packages
require(matrix)
require(xgboost)
require(pdp)

# dummy data
categorical <- c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B')
numerical <- c(1, 2, 3, 4, 1, 2, 3, 4)
target <- c(100, 200, 300, 400, 500, 600, 700, 800)
data <- data.frame(categorical, numerical, target)

# create sparse matrix and run xgb
data.sparse = sparse.model.matrix(target~.-1,data)
data.xgb <- xgboost(data=data.sparse, label=data$target, nrounds=100)

# attempt to create partial dependence plots
partial(data.xgb, pred.var="numerical", plot=TRUE, rug=TRUE, train=data, type="regression")
partial(data.xgb, pred.var="categorical", plot=TRUE, rug=TRUE, train=data, type="regression")
partial(data.xgb, pred.var="categoricalA", plot=TRUE, rug=TRUE, train=data.sparse, type="regression")
partial(data.xgb, pred.var="categoricalB", plot=TRUE, rug=TRUE, train=data.sparse, type="regression")

# confirm the model is making sensible predictions despite pdp looking odd
chk <- data[2,]
chk.sparse = sparse.model.matrix(target~.-1,chk)
chk.pred <- predict(data.xgb, chk.sparse)
print(chk.pred) # gives expected values e.g. 199.9992 for second row

Questions

  1. How can I display a PDP for the categorical variable so I see A and B on the one chart rather than having a line for categoricalA
  2. Why in this example does the model predict correct values yet the PDP on the numerical variable is flat
  3. I'd love for someone to post some code demonstrating how cross validation and/or grid search could be implemented in the example above (assuming data was bigger)

Many thanks

Upvotes: 1

Views: 2322

Answers (1)

see24
see24

Reputation: 1220

It appears you will have to output the data from partial by setting plot to FALSE and create your own plot. I recommend geom_crossbar for categorical variables. I looked into the code for the partial function in pdp on Github and there is a cats argument where you are supposed to name the categorical variables but it is not used any where in the function from what I can see. For cross validation and grid search use caret. This is a great resource to learn how.

Upvotes: 0

Related Questions