maciek
maciek

Reputation: 541

xgboost prediction in R is different on sparse and dense matrices

I've trained a simple model using xgboost library in R on a matrix produced by sparse.model.matrix, then I made a prediction on two validation datasets - one created by sparse.model.matrix from Matrix and the second one by model.matrix from stats. To my great suprise results differ significantly. Sparse and dense matrices have identical dimensions, all data is numerical and there are no missing values.

Mean prediction on these two sets is the following:

Is it a feature or a bug?

Update:

I've noticed that error does not occur when all values are positive xor negative. If the variable x1 has a definition x1=sample(1:7, 2000, replace=T), mean prediction is the same in both cases.

Code in R:

require(Matrix)
require(xgboost)

valid <- data.frame(y=sample(0:1, 2000, replace=T), x1=sample(-1:5, 2000, replace=T), x2=runif(2000))
train <- data.frame(y=sample(0:1, 10000, replace=T), x1=sample(-1:5, 10000, replace=T), x2=runif(10000))

sparse_train_matrix <- sparse.model.matrix(~ ., data=train[, c("x1", "x2")])
d_sparse_train_matrix <- xgb.DMatrix(sparse_train_matrix, label = train$y)

sparse_valid_matrix <- sparse.model.matrix(~ ., data=valid[, c("x1", "x2")])
d_sparse_valid_matrix <- xgb.DMatrix(sparse_valid_matrix, label = valid$y)

valid_matrix <- model.matrix(~ ., data=valid[, c("x1", "x2")])
d_valid_matrix <- xgb.DMatrix(valid_matrix, label = valid$y)

params = list(objective = "binary:logistic", seed = 99, eval_metric = "auc")

sparse_w <- list(train=d_sparse_train_matrix, test=d_sparse_valid_matrix)
set.seed(1)
sprase_fit_xgb <- xgb.train(data=d_sparse_train_matrix, watchlist=sparse_w, params=params, nrounds=100)

p1 <- predict(sprase_fit_xgb, newdata=d_valid_matrix, type="response")
p2 <- predict(sprase_fit_xgb, newdata=d_sparse_valid_matrix, type="response")

mean(p1); mean(p2)

My sessionInfo:

R version 3.4.1 (2017-06-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C 
[5] LC_TIME=Polish_Poland.1250

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] xgboost_0.6-4 Matrix_1.2-10 data.table_1.10.4 dplyr_0.7.1

loaded via a namespace (and not attached): [1] Rcpp_0.12.11 lattice_0.20-35 assertthat_0.2.0 grid_3.4.1 
[5] R6_2.2.2 magrittr_1.5 stringi_1.1.5 rlang_0.1.1 
[9] bindrcpp_0.2 tools_3.4.1 glue_1.1.1 compiler_3.4.1 
[13] pkgconfig_2.0.1 bindr_0.1 tibble_1.3.3

Upvotes: 5

Views: 2276

Answers (1)

maciek
maciek

Reputation: 541

I've found here and here that this is expected behaviour and make sense to me.

Upvotes: 4

Related Questions