sierra_papa
sierra_papa

Reputation: 394

RandomizedSearchCV sampling distribution

According to RandomizedSearchCV documentation (emphasis mine):

param_distributions: dict or list of dicts

Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly. If a list of dicts is given, first a dict is sampled uniformly, and then a parameter is sampled using that dict as above.

If my understanding of the above is correct, both algorithms (XGBClassifier and LogisticRegression) in the following example should be sampled with high probability (>99%), given n_iter = 10.

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline


param_grid = [
              {'scaler': [StandardScaler()],
               'feature_selection': [RFE(estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss'))],
               'feature_selection__n_features_to_select': [3],
               'classification': [XGBClassifier(use_label_encoder=False, eval_metric='logloss')],
               'classification__n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
               'classification__max_depth': [2, 5, 10],
               },
              {'scaler': [StandardScaler()],
               'feature_selection': [RFE(estimator=LogisticRegression())],
               'feature_selection__n_features_to_select': [3],
               'classification': [LogisticRegression()],
               'classification__C': [0.1],
               },
              ]


pipe = Pipeline(steps=[('scaler', StandardScaler()), ('feature_selection', RFE(estimator=LogisticRegression())),
                       ('classification', LogisticRegression())])

classifier = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid,
                                scoring='neg_brier_score', n_jobs=-1, verbose=10)

data = load_breast_cancer()
X = data.data
y = data.target.ravel()
classifier.fit(X, y)

What happens though is that every time I run it XGBClassifier gets chosen 10/10 times. I would expect one candidate to come from Logistic Regresion since the probability for each dict to be sampled is 50-50. If the search space between the two algoritms is more balanced ('classification__n_estimators': [100]) then the sampling works as expected. Can someone clarify what's going on here?

Upvotes: 1

Views: 435

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12582

Yes, this is incorrect behavior. There's an Issue filed: when all the entries are lists (none are scipy distributions), the current code selects points from the ParameterGrid, which means it will disproportionately choose points from the larger dictionary-grid from your list.

Until a fix gets merged, you might be able to work around this by using a scipy distribution for something you don't care about, say for verbose?

Upvotes: 1

Related Questions