Sean
Sean

Reputation: 83

How to tune parameter max_bin in lightgbm?

I ran a basic example of lightgbm to test how max_bin affect the model:

require(lightgbm)
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test

dtrain <- lgb.Dataset(data = train$data, label = train$label, free_raw_data = FALSE)
dtest <- lgb.Dataset(data = test$data, label = test$label, free_raw_data = FALSE)

valids <- list(train = dtrain, test = dtest)

set.seed(100)
bst <- lgb.train(data = dtrain,
             num_leaves = 31,
             learning_rate = 0.05,
             nrounds = 20,
             valids = valids,
             nthread = 2,
             max_bin = 32,
             objective = "binary")

I tried setting max_bin to 32 and 255, the two tests give the same outputs:

[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
[LightGBM] [Info] Total Bins 128
[LightGBM] [Info] Number of data: 6513, number of used features: 107
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1]:    train's binary_logloss:0.644852 test's binary_logloss:0.644853 
 ......
[20]:   train's binary_logloss:0.204922 test's binary_logloss:0.204929 

Why does the max_bin have no effect on model's training?

Upvotes: 1

Views: 12731

Answers (2)

xingpei Pang
xingpei Pang

Reputation: 1285

Binning is a technique for representing data in a discrete view(histogram). Lightgbm uses a histogram based algorithm to find the optimal split point while creating a weak learner. Therefore, each continuous numeric feature (e.g. number of views for a video) should be split into discrete bins. enter image description here

Also, in this GitHub repo, you can find some comprehensive experiments which completely explains the effect of changing max_bin on CPU and GPU. enter image description here

If you define max_bin 255 that means we can have a maximum of 255 unique values per feature. Then Small max_bin causes faster speed and large value improves accuracy.

Upvotes: 2

pplonski
pplonski

Reputation: 5859

You need to set max_bin during Dataset creation. When the Dataset is created the additional statistics are computed. I don't know the R implementation details, but in Python you pass it as params={"max_bin":32}.

Upvotes: 2

Related Questions