What is the difference between dataset, booster, cvbooster, sequence in lightgbm?

Question

I am a little confused about these data structure API in lightgbm. My understanding is like:

lgb.dataset is common usage to load data from pandas etc.
lgb.booster is to load a model you trained before? (not sure)
lgb.cvbooster is same as lgb.booster except some details? (not sure)
lgb.sequence is 1-dimension dataset like those used in DNN? (not sure)

Could someone clarify?

James Lamb · Accepted Answer

Summary

LightGBM model training begins by transforming raw data into a Dataset. To build up such an object incrementally, you ask LightGBM to iterate over chunks of data provided by a Sequence.

Training produces a model object called a Booster. This object can be saved in text or binary form, and its predict() method can be used to create predictions on new data.

LightGBM's R and Python packages contain a function lgb.cv() which performs k-fold cross-validation. This function produces a CVBooster, an object which contains a list of Booster objects (one per fold).

More Details

Dataset

Before training, LightGBM does some one-time preprocessing, like bucketing continuous features into histograms and dropping unsplittable features. See this answer for a more detailed description of that.

The Dataset class manages that preprocessing. In the lightgbm Python package, for example, you can use lgb.Dataset() to create one of these objects from a numpy array, scipy spare array, pandas DataFrame, or CSV/TSV file.

lgb.train() in the Python package expects to be passed on of these objects. Classes in the scikit-learn API expect to be passed raw data like numpy arrays and creates the Dataset for you internally.

Booster

The "B" in "LightGBM" stands for "Boosting". The Booster class is the core model object for LightGBM. It holds the current state of the model and has methods for doing things like continuing the training process (.update()), creating predictions on new data (.predict()), and more.

When training a model with the Python package, you can use lgb.train() to produce one of these Booster objects. If you use the scikit-learn API, the resulting model object will have a Booster inside it.

CVBooster

The CVBooster class does not existing in the core LightGBM C/C++ library. It is specific to wrapper packages like the R and Python packages.

"CV" in CVBooster stands for "cross-validation". It's the object produced by the the cross-validation function lgb.cv() in the LightGBM Python and R packages.

The CVBooster contains an attribute .boosters, which is a list of Booster objects (one for each fold from k-fold cross validation).

Sequence

The Sequence object was new as of LightGBM 3.3.0, and as of that release was only supported in the Python package.

The Sequence object is a Python class which allows LightGBM to iterate over chunks of raw data to build up a Dataset object incrementally.

This object should be used when your raw data are already partitioned into files (e.g. a collection of HDF5 files) or when you are concerned about exceeding the available memory during Dataset construction.

What is the difference between dataset, booster, cvbooster, sequence in lightgbm?

Answers (1)

Summary

More Details

Related Questions