Reputation: 21
I've run an XGBoost on a sparse matrix and am trying to display some partial dependence plots. I've been using PDP package but am open to suggestions. Below code is a reproducible example of what I'm trying to do.
# load required packages
require(matrix)
require(xgboost)
require(pdp)
# dummy data
categorical <- c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B')
numerical <- c(1, 2, 3, 4, 1, 2, 3, 4)
target <- c(100, 200, 300, 400, 500, 600, 700, 800)
data <- data.frame(categorical, numerical, target)
# create sparse matrix and run xgb
data.sparse = sparse.model.matrix(target~.-1,data)
data.xgb <- xgboost(data=data.sparse, label=data$target, nrounds=100)
# attempt to create partial dependence plots
partial(data.xgb, pred.var="numerical", plot=TRUE, rug=TRUE, train=data, type="regression")
partial(data.xgb, pred.var="categorical", plot=TRUE, rug=TRUE, train=data, type="regression")
partial(data.xgb, pred.var="categoricalA", plot=TRUE, rug=TRUE, train=data.sparse, type="regression")
partial(data.xgb, pred.var="categoricalB", plot=TRUE, rug=TRUE, train=data.sparse, type="regression")
# confirm the model is making sensible predictions despite pdp looking odd
chk <- data[2,]
chk.sparse = sparse.model.matrix(target~.-1,chk)
chk.pred <- predict(data.xgb, chk.sparse)
print(chk.pred) # gives expected values e.g. 199.9992 for second row
Questions
Many thanks
Upvotes: 1
Views: 2322
Reputation: 1220
It appears you will have to output the data from partial by setting plot to FALSE and create your own plot. I recommend geom_crossbar for categorical variables. I looked into the code for the partial function in pdp on Github and there is a cats argument where you are supposed to name the categorical variables but it is not used any where in the function from what I can see. For cross validation and grid search use caret. This is a great resource to learn how.
Upvotes: 0