Reputation: 11

Adaboost Sklearn Feature Importance NaN

I am building an Ada boost model with Sklearn. Last year I made the same model with the same data, and I was able to access the feature importances. This year when I build the model with the same data the feature importance attribute contains NaNs.I have read some other stuff where people have has the same problem and its where there is NaN's in their data, however mine does not.

I am at loss of what is different, but I have isolated the Base_estimator DecisionTree max_depth to be the problem. The higher the max_depth, the greater number of NaNs. However I have identified that max_depth=10 is best for my work. This is my code

Can anyone point out where I am going wrong? Or explain what is happening or another way to get the feature_importance?

I have recreated the same error with a sklearn dataset below.

I have a old version of sklearn with python 2.7 and with the same data this error doesn't occur.

Thank you

Data that I am working with is available here: https://github.com/scikit-learn/scikit-learn/discussions/20315

import pandas
import xarray
import numpy as np
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import AdaBoostClassifier

 
train_data=pandas.read_csv('data_train.csv')
model_variables=['RH','t2m','tp_r5','swvl1','SM_r20','tp','cvh','vdi','SM_r10','SM_IDW']

X = train_data[model_variables] # Features
y = train_data.ignition_no 

np.count_nonzero(np.isnan(y))
0
#no missing target variables

tree = DecisionTreeClassifier(max_depth=10, random_state=12)
ada_model= AdaBoostClassifier(base_estimator = tree, random_state=12)
model= ada_model.fit(X,y)
model.feature_importances_
/home/mo/morc/.virtualenvs/newroo/lib/python3.6/site-packages/sklearn/tree/_classes.py:605: RuntimeWarning: invalid value encountered in true_divide
  return self.tree_.compute_feature_importances()
array([       nan,        nan,        nan,        nan,        nan,
              nan,        nan, 0.02568412,        nan,        nan])
>>> 

#Here is the same error recreated with the load_digits dataset from sklearn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
 
from sklearn.datasets import load_digits
>>> dataset = load_digits()
>>> X = dataset['data']
>>> y = dataset['target']
>>> 
>>> score = []
>>> for depth in [1,2,10] : 
...     reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth))
...     scores_ada = cross_val_score(reg_ada, X, y, cv=6)
...     score.append(scores_ada.mean())
... 
score
>>>[0.2615310293571163, 0.6466908212560386, 0.9621609067261242]
#best depth is 10, so making ada_boost classifier with base_estimator of max_depth=10
reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=10))
model=reg_ada.fit(X,y)
model.feature_importances_
/home/mo/morc/.virtualenvs/fox/lib/python3.6/site-packages/sklearn/tree/_classes.py:605: RuntimeWarning: invalid value encountered in true_divide
  return self.tree_.compute_feature_importances()
array([0.00000000e+00, 3.97071545e-03,            nan, 1.04739889e-02,
       1.71911851e-02, 1.13877668e-02, 5.53334918e-03, 3.48635371e-03,
       3.81562332e-16, 2.97882448e-04, 5.21107270e-03, 1.90482369e-03,
       9.54317398e-03,            nan, 4.04579846e-03, 2.85770367e-03,
       2.41466161e-03, 2.22172771e-04,            nan,            nan,
       2.64452796e-02, 2.35455672e-02, 5.91982800e-03, 9.63862404e-15,
       2.51667106e-05, 8.22347398e-03, 3.53522516e-02, 3.49199633e-02,
                  nan,            nan, 7.85924750e-03, 0.00000000e+00,
       0.00000000e+00, 2.43861329e-02,            nan, 4.52136284e-03,
       2.84309340e-02, 8.70846798e-03,            nan, 0.00000000e+00,
       0.00000000e+00, 8.51258472e-03,            nan, 4.08880381e-02,
       6.47568594e-03, 1.75046890e-02, 1.37183583e-02, 3.95955193e-32,
       0.00000000e+00, 6.36631892e-05, 2.06906508e-02,            nan,
                  nan,            nan, 9.47079562e-03, 3.71242630e-03,
       0.00000000e+00, 7.14153611e-06,            nan, 5.14482654e-03,
       2.23621689e-02, 1.79753787e-02, 3.05869803e-03, 4.80512718e-03])

Upvotes: 1

Answers (2)

Alan Longley

Reputation: 1

About the nans that appear: I find they only appear using the SAMME.R algorithm and really make a mess of things. I am almost certain that SAMME.R is used by default. The nans do not appear using SAMME. I compute feature importances with SAMME then fit the most important features with SAMME.R

Two steps:

calculate feature importances with SAMME:

#import numpy as np

def sci_sort(self,X_train, y_train,depth,tris):
algo = 'SAMME'
    regr = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth),algorithm=algo,n_estimators=tris)
    regr.n_estimators =X_train.shape[1]+tris
    regr.n_jobs=-1
    regr.compute_importances = True
    regr.fit(X_train, y_train)
    fout = open('hugrunES_buy.txt','w')
    fout.write (str(list(np.argsort(regr.feature_importances_)))+'\n')
    fout.close()
    return np.argsort(regr.feature_importances_)

Do the actual fitting with SAMME.R:

def sci(self,X_train, y_train,X_test,y_test,nam,depth,trees,learning_ratee,tak):
    algo ='SAMME.R'
    regr = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth),algorithm=algo,n_estimators=trees,learning_rate=learning_ratee)
    regr.n_jobs=-1
    regr.fit(X_train, y_train)

Upvotes: 0

Ben Reiniger

Reputation: 12738

I've narrowed it down in your digits example. At the tree 20, feature 38 is used for five splits, and in the last of those (node 353), the impurity of the right child is -np.inf (!?). So the raw (un-normalized) importance at that split is +inf, so the total raw importance of the feature for this tree is +inf, and so when normalizing the importances for this tree, every other feature gets something / inf = 0, while this feature gets inf / inf = nan. Then aggregating that across trees, this feature (and others, because of presumably similar issues in other trees) has importance nan (and the other features' importances are skewed for not getting real contributions from this tree).

I cannot see what changed between 0.22 and 0.23 that causes this issue, nor do I really understand how the calculation comes up with -inf for the gini coefficient; perhaps some overflow issue?

Upvotes: 0

Adaboost Sklearn Feature Importance NaN

Answers (2)

Related Questions