mlindsk
mlindsk

Reputation: 37

Python - decision tree in lightgbm with odd values

I am trying to fit a single decision tree using the Python module lightgbm. However, I find the output a little strange. I have 15 explanatory variables and the numerical response variable has the following characteristic:

count    653.000000
mean      31.503813
std       11.838267
min       13.750000
25%       22.580000
50%       28.420000
75%       38.250000
max       76.750000
Name: X2, dtype: float64

I do the following to fit the tree: I first construct the Dataset object

df_train = lightgbm.Dataset(
    df, # The data 
    label = df[response], # The response series
    feature_name = features, # A list with names of all explanatory variables
    categorical_feature = categorical_vars # A list with names of the categorical ones
)

Next, I define the parameters and fit the model:

param = {
    # make it a single tree:
    'objective': 'regression',
    'bagging_freq':0,  # Disable bagging
    'feature_fraction':1, # don't randomly select features. consider all.
    'num_trees': 1,
    
    # tuning parameters
    'max_leaves': 20,
    'max_depth': -1,
    'min_data_in_leaf': 20
}

model = lightgbm.train(param, df_train)

From the model I extract the leaves of the tree as:

tree = model.trees_to_dataframe()[[
'right_child',
    'node_depth',
    'value',
    'count']]

leaves = tree[tree.right_child.isnull()]

print(leaves)

   right_child  node_depth      value  count
5         None           6  29.957982     20
6         None           6  30.138253     28
8         None           6  30.269373     34
9         None           6  30.404353     38
12        None           6  30.528705     33
13        None           6  30.651690     62
14        None           5  30.842856     59
17        None           5  31.080432     51
19        None           6  31.232860     21
20        None           6  31.358547     26
22        None           5  31.567571     43
23        None           5  31.795345     46
28        None           6  32.034321     27
29        None           6  32.247890     24
31        None           6  32.420886     22
32        None           6  32.594289     21
34        None           5  32.920932     20
35        None           5  33.210205     22
37        None           4  33.809376     36
38        None           4  34.887632     20

Now, if you look at the values, they range from (approximately) 30 to 35. This is far from capturing the distribution (shown above with min = 13.75 and max = 76.75) of the response variable.

Can anyone explain to me what is going on here?

Follow Up Based On Accepted Answer:

I tried to add 'learning_rate':1 and 'min_data_in_bin':1 to the parameter dict which resulted in the following tree:

   right_child  node_depth      value  count
5         None           6  16.045500     20
6         None           6  17.824074     27
8         None           6  19.157500     36
9         None           6  20.529730     37
12        None           6  21.805834     36
13        None           6  23.048387     62
14        None           5  24.975263     57
17        None           5  27.335385     52
19        None           6  29.006800     25
20        None           6  30.234286     21
22        None           5  32.221591     44
23        None           5  34.472272     44
28        None           6  36.808889     27
29        None           6  38.944583     24
31        None           6  40.674546     22
32        None           6  42.408572     21
34        None           5  45.675000     20
35        None           5  48.567728     22
37        None           4  54.559445     36
38        None           4  65.341999     20

This is much more desirable. This means, that we can now use lightgbm to mimic the behavior of a single decision tree with categorical features. As opposed to sklearn, lightgbm honors "true" categorical variables whereas in sklearn one needs to one-hot encode all categorical variables which can turn out really bad; see this kaggle post.

Upvotes: 0

Views: 499

Answers (1)

josemz
josemz

Reputation: 1312

As you may know LightGBM does a couple of tricks to speed things up. One of them is feature binning, where the values of the features are assigned to bins to reduce the possible number of splits. By default this number is 3, so for example if you have 100 samples you'd have about 34 bins.

Another important thing here when using a single tree is that LightGBM does boosting by default, which means that it will start from an initial score and try to gradually improve on it. That gradual change is controlled by the learning_rate which by default is 0.1, so the predictions from each tree are multiplied by this number and added to the current score.

The last thing to consider is that the tree size is controlled by num_leaves which is 31 by default. If you want to fully grow the tree you have to set this number to your number of samples.

So if you want to replicate a full-grown decision tree in LightGBM you have to adjust these parameters. Here's an example:

import lightgbm as lgb
import numpy as np
import pandas as pd

X = np.linspace(1, 2, 100)[:, None]
y = X[:, 0]**2
ds = lgb.Dataset(X, y)
params = {'num_leaves': 100, 'min_child_samples': 1, 'min_data_in_bin': 1, 'learning_rate': 1}
bst = lgb.train(params, ds, num_boost_round=1)
print(pd.concat([
    bst.trees_to_dataframe().loc[lambda x: x['left_child'].isnull(), 'value'].describe().rename('leaves'),
    pd.Series(y).describe().rename('y'),
], axis=1))
leaves y
count 100 100
mean 2.33502 2.33502
std 0.882451 0.882451
min 1 1
25% 1.56252 1.56252
50% 2.25003 2.25003
75% 3.06252 3.06252
max 4 4

Having said that, if you're looking for a decision tree it's easier to use scikit-learn's:

from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor().fit(X, y)
np.allclose(bst.predict(X), tree.predict(X))
# True

Upvotes: 2

Related Questions