user113156
user113156

Reputation: 7107

xgboost error over for loop but works normal over independently running xgboost

I am running into errors with xgboost and a for loop, the error I am obtaining is the following;

Error in xgb.iter.eval(bst$handle, watchlist, iteration - 1, feval) : 
  [23:48:27] amalgamation/../src/metric/rank_metric.cc:135: Check failed: !auc_error AUC: the dataset only contains pos or neg samples

Somebody else asked a similar quesstion, here

The creator of the package suggested the following;

This means some of your training data or evaluation data contains all 1 or all 0 as label

Which my problem is a binary classification problem, 0, 1,.

My code is as follows;

all <- NULL
for(i in 1:length(splitxgb)){
    xgbdata <- splitxgb[[i]]
    smp_size <- floor(0.75 * nrow(xgbdata))
    train_ind <- sample(seq_len(nrow(xgbdata)), size = smp_size)
    train <- xgbdata[train_ind, ]
    test <- xgbdata[-train_ind, ]
    ids <- sample(nrow(train))
    nfolds <- 5 #TAKE this out of the forloop
    score <- data.table()
    result <- data.table()

    x_train <- train %>%
      select(-BvD.ID.number, -Major.sectors, -Region.in.country, -Major.sectors.id, -Region.in.country.id, -status)
    x_test <- test %>%
      select(-BvD.ID.number, -Major.sectors, -Region.in.country, -Major.sectors.id, -Region.in.country.id, -status)
    y_train <- train$status
    y_test <- test$status

    nrounds <- 12 #take out of the for loop
    early_stopping_round <- NULL # take out of the for loop
    dtrain <- xgb.DMatrix(data = as.matrix(x_train), label = y_train, missing=NaN)
    dtest <- xgb.DMatrix(data = as.matrix(x_test), missing=NaN)
    watchlist <- list(train = dtrain)

    params <- list("eta" = 0.01,
                   "max_deptch" = 10,      # take out of the for loop
                   "colsample_bytree" = 0.50,
                   "min_child_weight" = 0.75,
                   "subsample" = 0.5,
                   "objective" = "reg:logistic", #should this be reg_log, binary:log etc.
                   "eval_metric" = "auc")

    model_xgb <- xgb.train(params = params,
                           data = dtrain,
                           maximize = TRUE,
                           nrounds = nrounds,
                           watchlist = watchlist,
                           early_stopping_rounds = early_stopping_round,
                           print_every_n = 1)

    pred <- predict(model_xgb, dtest)
    result <- cbind(test %>%
                      select(BvD.ID.number), status = round(pred, 0), pred)

    compare <- merge(x = result, y = test[ , c("BvD.ID.number", "status", "Region.in.country", "Major.sectors")], by = "BvD.ID.number", all.x=TRUE)
    all[[i]] <- compare

}

And I run into the error above... However when I take it all out of the for loop and run it individually for example as th following;

i <-165

xgbdata <- splitxgb[[i]]
smp_size <- floor(0.75 * nrow(xgbdata))
train_ind <- sample(seq_len(nrow(xgbdata)), size = smp_size)
train <- xgbdata[train_ind, ]
test <- xgbdata[-train_ind, ]
ids <- sample(nrow(train))
nfolds <- 5 #TAKE this out of the forloop
score <- data.table()
result <- data.table()

x_train <- train %>%
  select(-BvD.ID.number, -Major.sectors, -Region.in.country, -Major.sectors.id, -Region.in.country.id, -status)
x_test <- test %>%
  select(-BvD.ID.number, -Major.sectors, -Region.in.country, -Major.sectors.id, -Region.in.country.id, -status)
y_train <- train$status
y_test <- test$status

nrounds <- 12 #take out of the for loop
early_stopping_round <- NULL # take out of the for loop
dtrain <- xgb.DMatrix(data = as.matrix(x_train), label = y_train, missing=NaN)
dtest <- xgb.DMatrix(data = as.matrix(x_test), missing=NaN)
watchlist <- list(train = dtrain)

params <- list("eta" = 0.01,
               "max_deptch" = 10,      # take out of the for loop
               "colsample_bytree" = 0.50,
               "min_child_weight" = 0.75,
               "subsample" = 0.5,
               "objective" = "reg:logistic", #should this be reg_log, binary:log etc.
               "eval_metric" = "auc")

model_xgb <- xgb.train(params = params,
                       data = dtrain,
                       maximize = TRUE,
                       nrounds = nrounds,
                       watchlist = watchlist,
                       early_stopping_rounds = early_stopping_round,
                       print_every_n = 1)

pred <- predict(model_xgb, dtest)
result <- cbind(test %>%
                  select(BvD.ID.number), status = round(pred, 0), pred)

compare <- merge(x = result, y = test[ , c("BvD.ID.number", "status", "Region.in.country", "Major.sectors")], by = "BvD.ID.number", all.x=TRUE)
all[[i]] <- compare

And I run this for each i separately... I obtain no errors,

There is some information online but nothing specific to the problem I run into, why am I obtaining erros in the loop but not individually?.

Upvotes: 0

Views: 722

Answers (1)

Eran Moshe
Eran Moshe

Reputation: 3208

Looks like your splits sometimes split the data, either train or test, so that all the labels are either 1's or 0's.

Try to print (or write into CSV) all your divisions and see if its right.

If so, you want to make sure there's at least 1 row of data of each label for every split (for train and for test).

You can do it by repeating the split until such condition exists, or force it in any other way you chose, in the code.

I would suggest re-sampling until such condition exists.

Upvotes: 1

Related Questions