Reputation: 37
I am trying to fit a single decision tree using the Python module lightgbm
. However, I find the output a little strange. I have 15
explanatory variables and the numerical response variable has the following characteristic:
count 653.000000
mean 31.503813
std 11.838267
min 13.750000
25% 22.580000
50% 28.420000
75% 38.250000
max 76.750000
Name: X2, dtype: float64
I do the following to fit the tree: I first construct the Dataset
object
df_train = lightgbm.Dataset(
df, # The data
label = df[response], # The response series
feature_name = features, # A list with names of all explanatory variables
categorical_feature = categorical_vars # A list with names of the categorical ones
)
Next, I define the parameters and fit the model:
param = {
# make it a single tree:
'objective': 'regression',
'bagging_freq':0, # Disable bagging
'feature_fraction':1, # don't randomly select features. consider all.
'num_trees': 1,
# tuning parameters
'max_leaves': 20,
'max_depth': -1,
'min_data_in_leaf': 20
}
model = lightgbm.train(param, df_train)
From the model I extract the leaves of the tree as:
tree = model.trees_to_dataframe()[[
'right_child',
'node_depth',
'value',
'count']]
leaves = tree[tree.right_child.isnull()]
print(leaves)
right_child node_depth value count
5 None 6 29.957982 20
6 None 6 30.138253 28
8 None 6 30.269373 34
9 None 6 30.404353 38
12 None 6 30.528705 33
13 None 6 30.651690 62
14 None 5 30.842856 59
17 None 5 31.080432 51
19 None 6 31.232860 21
20 None 6 31.358547 26
22 None 5 31.567571 43
23 None 5 31.795345 46
28 None 6 32.034321 27
29 None 6 32.247890 24
31 None 6 32.420886 22
32 None 6 32.594289 21
34 None 5 32.920932 20
35 None 5 33.210205 22
37 None 4 33.809376 36
38 None 4 34.887632 20
Now, if you look at the values, they range from (approximately) 30
to 35
. This is far from capturing the distribution (shown above with min = 13.75
and max = 76.75
) of the response variable.
Can anyone explain to me what is going on here?
Follow Up Based On Accepted Answer:
I tried to add 'learning_rate':1
and 'min_data_in_bin':1
to the parameter dict which resulted in the following tree:
right_child node_depth value count
5 None 6 16.045500 20
6 None 6 17.824074 27
8 None 6 19.157500 36
9 None 6 20.529730 37
12 None 6 21.805834 36
13 None 6 23.048387 62
14 None 5 24.975263 57
17 None 5 27.335385 52
19 None 6 29.006800 25
20 None 6 30.234286 21
22 None 5 32.221591 44
23 None 5 34.472272 44
28 None 6 36.808889 27
29 None 6 38.944583 24
31 None 6 40.674546 22
32 None 6 42.408572 21
34 None 5 45.675000 20
35 None 5 48.567728 22
37 None 4 54.559445 36
38 None 4 65.341999 20
This is much more desirable. This means, that we can now use lightgbm
to mimic the behavior of a single decision tree with categorical features. As opposed to sklearn
, lightgbm
honors "true" categorical variables whereas in sklearn
one needs to one-hot encode all categorical variables which can turn out really bad; see this kaggle post.
Upvotes: 0
Views: 499
Reputation: 1312
As you may know LightGBM does a couple of tricks to speed things up. One of them is feature binning, where the values of the features are assigned to bins to reduce the possible number of splits. By default this number is 3, so for example if you have 100 samples you'd have about 34 bins.
Another important thing here when using a single tree is that LightGBM does boosting by default, which means that it will start from an initial score and try to gradually improve on it. That gradual change is controlled by the learning_rate
which by default is 0.1, so the predictions from each tree are multiplied by this number and added to the current score.
The last thing to consider is that the tree size is controlled by num_leaves
which is 31 by default. If you want to fully grow the tree you have to set this number to your number of samples.
So if you want to replicate a full-grown decision tree in LightGBM you have to adjust these parameters. Here's an example:
import lightgbm as lgb
import numpy as np
import pandas as pd
X = np.linspace(1, 2, 100)[:, None]
y = X[:, 0]**2
ds = lgb.Dataset(X, y)
params = {'num_leaves': 100, 'min_child_samples': 1, 'min_data_in_bin': 1, 'learning_rate': 1}
bst = lgb.train(params, ds, num_boost_round=1)
print(pd.concat([
bst.trees_to_dataframe().loc[lambda x: x['left_child'].isnull(), 'value'].describe().rename('leaves'),
pd.Series(y).describe().rename('y'),
], axis=1))
leaves | y | |
---|---|---|
count | 100 | 100 |
mean | 2.33502 | 2.33502 |
std | 0.882451 | 0.882451 |
min | 1 | 1 |
25% | 1.56252 | 1.56252 |
50% | 2.25003 | 2.25003 |
75% | 3.06252 | 3.06252 |
max | 4 | 4 |
Having said that, if you're looking for a decision tree it's easier to use scikit-learn's:
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor().fit(X, y)
np.allclose(bst.predict(X), tree.predict(X))
# True
Upvotes: 2