Reputation: 83
I ran a basic example of lightgbm to test how max_bin affect the model:
require(lightgbm)
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test
dtrain <- lgb.Dataset(data = train$data, label = train$label, free_raw_data = FALSE)
dtest <- lgb.Dataset(data = test$data, label = test$label, free_raw_data = FALSE)
valids <- list(train = dtrain, test = dtest)
set.seed(100)
bst <- lgb.train(data = dtrain,
num_leaves = 31,
learning_rate = 0.05,
nrounds = 20,
valids = valids,
nthread = 2,
max_bin = 32,
objective = "binary")
I tried setting max_bin to 32 and 255, the two tests give the same outputs:
[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
[LightGBM] [Info] Total Bins 128
[LightGBM] [Info] Number of data: 6513, number of used features: 107
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1]: train's binary_logloss:0.644852 test's binary_logloss:0.644853
......
[20]: train's binary_logloss:0.204922 test's binary_logloss:0.204929
Why does the max_bin have no effect on model's training?
Upvotes: 1
Views: 12731
Reputation: 1285
Binning is a technique for representing data in a discrete view(histogram). Lightgbm uses a histogram based algorithm to find the optimal split point while creating a weak learner. Therefore, each continuous numeric feature (e.g. number of views for a video) should be split into discrete bins.
Also, in this GitHub repo, you can find some comprehensive experiments which completely explains the effect of changing max_bin on CPU and GPU.
If you define max_bin 255 that means we can have a maximum of 255 unique values per feature. Then Small max_bin causes faster speed and large value improves accuracy.
Upvotes: 2
Reputation: 5859
You need to set max_bin
during Dataset
creation. When the Dataset
is created the additional statistics are computed. I don't know the R implementation details, but in Python you pass it as params={"max_bin":32}
.
Upvotes: 2