Reputation: 1139
The imblearn library is a library used for unbalanced classifications. It allows you to use scikit-learn
estimators while balancing the classes using a variety of methods, from undersampling to oversampling to ensembles.
My question is however, how can I get feature improtance of the estimator after using BalancedBaggingClassifier
or any other sampling method from imblearn?
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier
X, y = make_classification(n_classes=2, class_sep=2,weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape {}'.format(Counter(y)))
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)
bbc = BalancedBaggingClassifier(random_state=42,base_estimator=DecisionTreeClassifier(criterion=criteria_,max_features='sqrt',random_state=1),n_estimators=2000)
bbc.fit(X_train,y_train)
Upvotes: 5
Views: 3511
Reputation: 1837
According to scikit learn documentation, you can use impurity-based feature importance on classifications, that don't have their own using some sort of ForestClassifier
.
Here my classifier
doesn't have feature_importances_
, I'm adding it directly.
classifier.fit(x_train, y_train)
...
...
forest = ExtraTreesClassifier(n_estimators=classifier.n_estimators,
random_state=classifier.random_state)
forest.fit(x_train, y_train)
classifier.feature_importances_ = forest.feature_importances_
Upvotes: 0
Reputation: 3775
Not all estimators in sklearn
allow you to get feature importances (for example, BaggingClassifier
doesn't). If the estimator does, it looks like it should just be stored as estimator.feature_importances_
, since the imblearn
package subclasses from sklearn
classes. I don't know what estimators imblearn
has implemented, so I don't know if there are any that provide feature_importances_
, but in general you should look at the sklearn
documentation for the corresponding object to see if it does.
You can, in this case, look at the feature importances for each of the estimators within the BalancedBaggingClassifier
, like this:
for estimator in bbc.estimators_:
print(estimator.steps[1][1].feature_importances_)
And you can print the mean importance across the estimators like this:
print(np.mean([est.steps[1][1].feature_importances_ for est in bbc.estimators_], axis=0))
Upvotes: 5
Reputation: 1139
There is a shortcut around this, however it is not very efficient. The BalancedBaggingClassifier
uses the RandomUnderSampler successively and fits the estimator on top. A for-loop with RandomUnderSampler can be one way of going around the pipeline method, and then call the Scikit-learn estimator directly. This will also allow to look at feature_importance:
from imblearn.under_sampling import RandomUnderSampler
rus=RandomUnderSampler(random_state=1)
my_list=[]
for i in range(0,10): #random under sampling 10 times
X_pl,y_pl=rus.sample(X_train,y_train,)
my_list.append((X_pl,y_pl)) #forming tuples from samples
X_pl=[]
Y_pl=[]
for num in range(0,len(my_list)): #Creating the dataframes for input/output
X_pl.append(pd.DataFrame(my_list[num][0]))
Y_pl.append(pd.DataFrame(my_list[num][1]))
X_pl_=pd.concat(X_pl) #Concatenating the DataFrames
Y_pl_=pd.concat(Y_pl)
RF=RandomForestClassifier(n_estimators=2000,criterion='gini',max_features=25,random_state=1)
RF.fit(X_pl_,Y_pl_)
RF.feature_importances_
Upvotes: 0