Reputation: 394
I've been working with LightGBM in R, using the official package, for a few weeks now. Recently I was reviewing the API in their guide and came across the function lgb.Dataset.construct, which I hadn't been using. But the stuff I've been doing seems to work fine, so maybe this call is unnecessary? The man page (https://lightgbm.readthedocs.io/en/latest/R/reference/lgb.Dataset.construct.html) isn't particularly useful, simply saying it 'Construct[s] Dataset explicitly' and giving the example:
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
lgb.Dataset.construct(dtrain)
My code looks much like this, but without the last line, and like I said has been working fine (as far as I can tell)
Upvotes: 0
Views: 1304
Reputation: 2670
LightGBM does some one-time preprocessing, like bucketing continuous features into histograms and dropping unsplittable features, before training. See this answer for a more detailed description of that.
Creating a Dataset
object in the R package tells LightGBM where to find the raw (unprocessed) data and what parameters you want to use when doing that preprocessing, but it doesn't actually do that work.
That preprocessing work only actually happens once the Dataset
is "constructed".
But the stuff I've been doing seems to work fine, so maybe this call is unnecessary
If you use lightgbm::lgb.cv()
, lightgbm::lightgbm()
, or lightgbm::lgb.train()
, you don't need to call lgb.Dataset.construct()
beforehand. It will be called by {lightgbm}
inside those functions.
For example, you can run the following code below to see that training with lgb.train()
doesn't require explicitly calling lgb.Dataset.construct()
.
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.train(
params = list(
objective = "regression"
, metric = "l2"
)
, data = dtrain
, nrounds = 5L
)
Given that background, to the question in the title.
what is the purpose of lgb.Dataset.construct?
This function can be used to run Dataset
construction outside of the training process. In most cases you don't need to do that, but it might be useful if you want to, for example, measure how long the training process takes and you want to remove Dataset construction from that timing.
Upvotes: 1