Dmitriy Grankin
Dmitriy Grankin

Reputation: 618

lightgbm memory issue on wide dataset (400 columns)

I am new to lightgbm. I have big data (billions of rows constantly updated). The dataset prepared for training is also wide with around 400 columns.

I have 2 questions:

First, my kernel keeps dying after some thousands epochs even for such a small subset as 10 000 rows. Memory use keeps rising while training untill it fails. I have 126 gigabytes of memory.

I have tried training with different parameters, commented are the one that are tried as well

parameters = {
  'histogram_pool_size': 5000,
  'objective': 'regression',
  'metric': 'l2',
  'boosting': 'dart',#'gbdt
  'num_leaves': 10, #100
  'learning_rate': 0.01,
  'verbose': 0,
  'max_bin': 66,.
  'force_col_wise':True, #default
  'max_bin': 6,  #60 #default
  'max_depth': 10, #default
  'min_data_in_leaf': 30, #default
  'min_child_samples': 20,#default
  'feature_fraction': 0.5,#default
  'bagging_fraction': 0.8,#default
  'bagging_freq': 40,#default
  'bagging_seed': 11,#default
  'lambda_l1': 2 #default
  'lambda_l2': 0.1 #default }

Limiting number of columns seems to help, but I know that some columns that have low score with global feature importance would have significant importance in some local scope.

Second, what is the right way of training lightgbm with big data incrementally and updating lightgbm model with new data? I previously worked mainly with neural nets, which are trained incrementally by nature and I know that trees do not works this way and though it's technically possible to update the model it will not be the same as the model that is trained in a holistic way. How to deal with it?

full code:

# X is dataframe

cat_names =  X.select_dtypes(['bool','category',object]).columns.tolist()

for c in cat_names: X[c] = X[c].astype('category')
cat_cols = [c for c, col in enumerate(cat_names)]
X[cat_names] = X[cat_names].apply(lambda x: x.cat.codes)
x = X.values
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.2, random_state=42)

train_ds = lightgbm.Dataset(x_train, label=y_train)
valid_ds = lightgbm.Dataset(x_valid, label=y_valid)

model = lightgbm.train(parameters,
                   train_ds,
                   valid_sets=valid_ds,
                   categorical_feature = cat_cols,
                   num_boost_round=2000,
                   early_stopping_rounds=50)

Upvotes: 1

Views: 1632

Answers (1)

Dmitriy Grankin
Dmitriy Grankin

Reputation: 618

Changing data types to less verbose fixed the memory problem! If your dataset is pandas dataframe do something like this:

ds[ds.select_dtypes('float64').columns] = ds.select_dtypes('float64').astype('float32')
ds[ds.select_dtypes('int64').columns] = ds.select_dtypes('int64').astype('int32')

!!! caution Your data ranges may be out of the selected datatype range and pandas will mess up your data in that case. For example int8 dtype is ranges only within -128 to 127, so select the ones that are capable to handle your data.

You may check selected dtype range with

import numpy as np
np.iinfo('int32').min, np.iinfo('int32').max

Upvotes: 3

Related Questions