Reputation: 43
I am a little confused about these data structure API in lightgbm. My understanding is like:
Could someone clarify?
Upvotes: 4
Views: 2938
Reputation: 2670
LightGBM model training begins by transforming raw data into a Dataset
. To build up such an object incrementally, you ask LightGBM to iterate over chunks of data provided by a Sequence
.
Training produces a model object called a Booster
. This object can be saved in text or binary form, and its predict()
method can be used to create predictions on new data.
LightGBM's R and Python packages contain a function lgb.cv()
which performs k-fold cross-validation. This function produces a CVBooster
, an object which contains a list of Booster
objects (one per fold).
Dataset
Before training, LightGBM does some one-time preprocessing, like bucketing continuous features into histograms and dropping unsplittable features. See this answer for a more detailed description of that.
The Dataset
class manages that preprocessing. In the lightgbm
Python package, for example, you can use lgb.Dataset()
to create one of these objects from a numpy
array, scipy
spare array, pandas
DataFrame, or CSV/TSV file.
lgb.train()
in the Python package expects to be passed on of these objects. Classes in the scikit-learn API expect to be passed raw data like numpy
arrays and creates the Dataset
for you internally.
Booster
The "B" in "LightGBM" stands for "Boosting". The Booster
class is the core model object for LightGBM. It holds the current state of the model and has methods for doing things like continuing the training process (.update()
), creating predictions on new data (.predict()
), and more.
When training a model with the Python package, you can use lgb.train()
to produce one of these Booster
objects. If you use the scikit-learn API, the resulting model object will have a Booster
inside it.
CVBooster
The CVBooster
class does not existing in the core LightGBM C/C++ library. It is specific to wrapper packages like the R and Python packages.
"CV" in CVBooster
stands for "cross-validation". It's the object produced by the the cross-validation function lgb.cv()
in the LightGBM Python and R packages.
The CVBooster
contains an attribute .boosters
, which is a list of Booster
objects (one for each fold from k-fold cross validation).
Sequence
The Sequence
object was new as of LightGBM 3.3.0, and as of that release was only supported in the Python package.
The Sequence
object is a Python class which allows LightGBM to iterate over chunks of raw data to build up a Dataset
object incrementally.
This object should be used when your raw data are already partitioned into files (e.g. a collection of HDF5 files) or when you are concerned about exceeding the available memory during Dataset
construction.
Upvotes: 4