Uniformly distributed random variables in RandomSearchCV algorithm

Question

i would like to clarify one thing. i know that following command will generate a uniformly distributed random variable between(loc, loc+scale)

from scipy.stats import uniform
C =uniform.rvs(loc=0,scale=4)
print(C)

and let us suppose that i want to use this distribution in logistic regression while using RandomiizedSearchCV algorithm as it is shown below :

parameters =dict(C =uniform(loc=0,scale=4),penalty=['l2', 'l1'])
from sklearn.model_selection import RandomizedSearchCV
clf = RandomizedSearchCV(logreg, parameters, random_state=0)
search = clf.fit(iris.data, iris.target)
print(search.best_params_)

but i did not understand one thing: RandomizedSearchCV is like a gridsearch, just it tries to select random number of combination with given amount of trial (n_iter), but here C is a object, it is not array or something like this, even i can't print its value, so how can i understand this code? how it generates random number? without indication of rvs?

desertnaut · Accepted Answer

According to the documentation for the param_distributions argument (here parameters):

Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly.

So, what is happening at each iteration is:

Sample a value for C according to a uniform distribution in [0, 4]
Sample a value for penalty, uniformly between l1 and l2 (i.e with 50% probability for each)
Use these sampled values for running a CV and store the results

Using the example from the documentation (practically identical with the parameters in your question):

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
                               random_state=0)
distributions = dict(C=uniform(loc=0, scale=4),
                      penalty=['l2', 'l1'])

clf = RandomizedSearchCV(logistic, distributions, random_state=0)
search = clf.fit(iris.data, iris.target)

we get

search.best_params_
# {'C': 2.195254015709299, 'penalty': 'l1'}

We can go a step further, and see all the (10) combinations used, along with their performance:

import pandas as pd
df = pd.DataFrame(search.cv_results_)
print(df[['params','mean_test_score']])
# result:
                                        params  mean_test_score
0    {'C': 2.195254015709299, 'penalty': 'l1'}         0.980000
1   {'C': 3.3770629943240693, 'penalty': 'l1'}         0.980000
2   {'C': 2.1795327319875875, 'penalty': 'l1'}         0.980000
3   {'C': 2.4942547871438894, 'penalty': 'l2'}         0.980000
4     {'C': 1.75034884505077, 'penalty': 'l2'}         0.980000
5  {'C': 0.22685190926977272, 'penalty': 'l2'}         0.966667
6   {'C': 1.5337660753031108, 'penalty': 'l2'}         0.980000
7   {'C': 3.2486749151019727, 'penalty': 'l2'}         0.980000
8   {'C': 2.2721782443757292, 'penalty': 'l1'}         0.980000
9     {'C': 3.34431505414951, 'penalty': 'l2'}         0.980000

from where it is apparent indeed that all values of C tried were in [0, 4], as requested. Also, since there were more than one combinations that achieved a best score of 0.98, scikit-learn uses the first one as returned in cv_results_.

Looking closely, we see that only 4 trials were run with l1 penalty (and not the 50% of the 10, i.e. 5, as we might expect), but this is something to be expected with small random samples (here only 10).

Uniformly distributed random variables in RandomSearchCV algorithm

Answers (2)

Related Questions