Reputation: 701
I want to compute the features importances of a given dataset using ExtraTreesClassifier. My target is to find high scored features for further classification processes. X dataset has a size (10000, 50) where 50 columns are the features and this dataset represents only data collected from one user (i.e., from the same class) and Y is the labels (all zeros).
However, the output return all features importances as zeros!!
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,Y)
X = pd.DataFrame(X)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()
Can anyone tell me why all features have zero importance scores?
Upvotes: 1
Views: 2996
Reputation: 11
If the class labels all have the same value then the feature importances will all be 0.
I am not familiar enough with the algorithms to give a technical explanation as to why the importances are returned as 0 rather than nan or similar, but from a theoretical perspective:
You are using an ExtraTreesClassifier which is an ensemble of decision trees. Each of these decision trees will attempt to differentiate between samples of different classes in the target by minimizing impurity in some way (gini or entropy in the case of sklearn extra trees). When the target only contains samples of a single class, the impurity is already at the minimum, so no splits are required by the decision tree to reduce impurity any further. As such, no features are required to reduce impurity, so each feature will have an importance of 0.
Consider this another way. Each feature has exactly the same relationship with the target as any other feature: the target is 0 no matter what the value of the feature is for a specific sample (the target is completely independent of the feature). As such, each feature provides no new information about the target and so has no value for making a prediction.
Upvotes: 1
Reputation: 119
You can try doing:
print(sorted(zip(model.feature_importances_, X.columns), reverse=True))
Upvotes: 0