Reputation: 11
I am building an Ada boost model with Sklearn. Last year I made the same model with the same data, and I was able to access the feature importances. This year when I build the model with the same data the feature importance attribute contains NaNs.I have read some other stuff where people have has the same problem and its where there is NaN's in their data, however mine does not.
I am at loss of what is different, but I have isolated the Base_estimator DecisionTree max_depth to be the problem. The higher the max_depth, the greater number of NaNs. However I have identified that max_depth=10 is best for my work. This is my code
Can anyone point out where I am going wrong? Or explain what is happening or another way to get the feature_importance?
I have recreated the same error with a sklearn dataset below.
I have a old version of sklearn with python 2.7 and with the same data this error doesn't occur.
Thank you
Data that I am working with is available here: https://github.com/scikit-learn/scikit-learn/discussions/20315
import pandas
import xarray
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
train_data=pandas.read_csv('data_train.csv')
model_variables=['RH','t2m','tp_r5','swvl1','SM_r20','tp','cvh','vdi','SM_r10','SM_IDW']
X = train_data[model_variables] # Features
y = train_data.ignition_no
np.count_nonzero(np.isnan(y))
0
#no missing target variables
tree = DecisionTreeClassifier(max_depth=10, random_state=12)
ada_model= AdaBoostClassifier(base_estimator = tree, random_state=12)
model= ada_model.fit(X,y)
model.feature_importances_
/home/mo/morc/.virtualenvs/newroo/lib/python3.6/site-packages/sklearn/tree/_classes.py:605: RuntimeWarning: invalid value encountered in true_divide
return self.tree_.compute_feature_importances()
array([ nan, nan, nan, nan, nan,
nan, nan, 0.02568412, nan, nan])
>>>
#Here is the same error recreated with the load_digits dataset from sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.datasets import load_digits
>>> dataset = load_digits()
>>> X = dataset['data']
>>> y = dataset['target']
>>>
>>> score = []
>>> for depth in [1,2,10] :
... reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth))
... scores_ada = cross_val_score(reg_ada, X, y, cv=6)
... score.append(scores_ada.mean())
...
score
>>>[0.2615310293571163, 0.6466908212560386, 0.9621609067261242]
#best depth is 10, so making ada_boost classifier with base_estimator of max_depth=10
reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=10))
model=reg_ada.fit(X,y)
model.feature_importances_
/home/mo/morc/.virtualenvs/fox/lib/python3.6/site-packages/sklearn/tree/_classes.py:605: RuntimeWarning: invalid value encountered in true_divide
return self.tree_.compute_feature_importances()
array([0.00000000e+00, 3.97071545e-03, nan, 1.04739889e-02,
1.71911851e-02, 1.13877668e-02, 5.53334918e-03, 3.48635371e-03,
3.81562332e-16, 2.97882448e-04, 5.21107270e-03, 1.90482369e-03,
9.54317398e-03, nan, 4.04579846e-03, 2.85770367e-03,
2.41466161e-03, 2.22172771e-04, nan, nan,
2.64452796e-02, 2.35455672e-02, 5.91982800e-03, 9.63862404e-15,
2.51667106e-05, 8.22347398e-03, 3.53522516e-02, 3.49199633e-02,
nan, nan, 7.85924750e-03, 0.00000000e+00,
0.00000000e+00, 2.43861329e-02, nan, 4.52136284e-03,
2.84309340e-02, 8.70846798e-03, nan, 0.00000000e+00,
0.00000000e+00, 8.51258472e-03, nan, 4.08880381e-02,
6.47568594e-03, 1.75046890e-02, 1.37183583e-02, 3.95955193e-32,
0.00000000e+00, 6.36631892e-05, 2.06906508e-02, nan,
nan, nan, 9.47079562e-03, 3.71242630e-03,
0.00000000e+00, 7.14153611e-06, nan, 5.14482654e-03,
2.23621689e-02, 1.79753787e-02, 3.05869803e-03, 4.80512718e-03])
Upvotes: 1
Views: 1668
Reputation: 1
About the nans that appear: I find they only appear using the SAMME.R algorithm and really make a mess of things. I am almost certain that SAMME.R is used by default. The nans do not appear using SAMME. I compute feature importances with SAMME then fit the most important features with SAMME.R
Two steps:
#import numpy as np
def sci_sort(self,X_train, y_train,depth,tris):
algo = 'SAMME'
regr = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth),algorithm=algo,n_estimators=tris)
regr.n_estimators =X_train.shape[1]+tris
regr.n_jobs=-1
regr.compute_importances = True
regr.fit(X_train, y_train)
fout = open('hugrunES_buy.txt','w')
fout.write (str(list(np.argsort(regr.feature_importances_)))+'\n')
fout.close()
return np.argsort(regr.feature_importances_)
def sci(self,X_train, y_train,X_test,y_test,nam,depth,trees,learning_ratee,tak):
algo ='SAMME.R'
regr = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth),algorithm=algo,n_estimators=trees,learning_rate=learning_ratee)
regr.n_jobs=-1
regr.fit(X_train, y_train)
Upvotes: 0
Reputation: 12738
I've narrowed it down in your digits example. At the tree 20, feature 38 is used for five splits, and in the last of those (node 353), the impurity of the right child is -np.inf
(!?). So the raw (un-normalized) importance at that split is +inf
, so the total raw importance of the feature for this tree is +inf
, and so when normalizing the importances for this tree, every other feature gets something / inf = 0
, while this feature gets inf / inf = nan
. Then aggregating that across trees, this feature (and others, because of presumably similar issues in other trees) has importance nan
(and the other features' importances are skewed for not getting real contributions from this tree).
I cannot see what changed between 0.22 and 0.23 that causes this issue, nor do I really understand how the calculation comes up with -inf
for the gini coefficient; perhaps some overflow issue?
Upvotes: 0