Reputation: 218
I am trying to select features from gradient boosting using bootstrapping - performing the bootstapping via the BaggingRegressor
in scikit-learn. I am not sure this is possible or correct, but this is what I've tried:
bag = BaggingRegressor(base_estimator=GradientBoostingRegressor(), bootstrap_features=True, random_state=seed)
bag.fit(X,Y)
model = SelectFromModel(bag, prefit=True, threshold='mean')
gbr_boot = model.transform(X)
print('gbr_boot', gbr_boot.shape)
This gives the error:
ValueError: The underlying estimator BaggingRegressor has no `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to SelectFromModel or call fit before calling transform.
I am not sure how to address this error, I thought gradient boosting gives the feature_importances_
. I have tried working around it with:
bag = BaggingRegressor(base_estimator=GradientBoostingRegressor(), bootstrap_features=True, random_state=seed)
bag.fit(X,Y)
feature_importances = np.mean([
tree.feature_importances_ for tree in bag.estimators_
], axis=0)
threshold = np.mean(feature_importances)
temp=()
for i in feature_importances:
if i > threshold:
temp=temp + ((i),)
else:
temp=temp + (('null'),)
model_features=data.columns
feature = pd.DataFrame(np.array(model_features))
df = pd.DataFrame(temp)
df_total = pd.concat([feature, df], axis=1)
This seems to be successful in giving selected features surpassing the importance threshold I've made, but I am not sure if I am finding the true feature selection from BaggingRegressor
which SelectFromModel
would also find, or if (as the scikit-learn error implies to me) it does not exist for this method. For clarity, the reason I am trying BaggingRegressor
bootstrapping is due to SelectFromModel
with gradient boosting alone fluctuating in the number of features it selects, and I read a paper ( section 7.1) saying bootstrapping can reduce this variance (as I understood it, I don't have a CS/stats background).
Upvotes: 1
Views: 897
Reputation: 16966
You have to create a wrapper on BaggingRegressor
for this problem.
class MyBaggingRegressor(BaggingRegressor):
@property
def feature_importances_(self):
return self.regressor_.feature_importances_
@property
def coef_(self):
return self.regressor_.coef_
There is an existing issue regarding this in sklearn
here and the corresponding PR.
Note: you don't have to go for BaggingRegressor, if your base_estimator is GradientBoostingRegressor
.
use the subsample
param to achieve the same.
subsample: float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
Upvotes: 1