user466534
user466534

Reputation:

Uniformly distributed random variables in RandomSearchCV algorithm

i would like to clarify one thing. i know that following command will generate a uniformly distributed random variable between(loc, loc+scale)

from scipy.stats import uniform
C =uniform.rvs(loc=0,scale=4)
print(C)

and let us suppose that i want to use this distribution in logistic regression while using RandomiizedSearchCV algorithm as it is shown below :

parameters =dict(C =uniform(loc=0,scale=4),penalty=['l2', 'l1'])
from sklearn.model_selection import RandomizedSearchCV
clf = RandomizedSearchCV(logreg, parameters, random_state=0)
search = clf.fit(iris.data, iris.target)
print(search.best_params_)

but i did not understand one thing: RandomizedSearchCV is like a gridsearch, just it tries to select random number of combination with given amount of trial (n_iter), but here C is a object, it is not array or something like this, even i can't print its value, so how can i understand this code? how it generates random number? without indication of rvs?

Upvotes: 1

Views: 1239

Answers (2)

you want to use more than one value for C for RandomizedSearchCV to discover. refit=True and return_train_score=True allow you to use the clf with the best model fit.

 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42)

 logreg=LogisticRegression(C=5,max_iter=10000)
 #https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
 parameter_grid={'C':[uniform.rvs(loc=0,scale=4),3,4,10,100],'tol':[0.1,0.2,0.3,0.5,1,5,10,100]}  

 clf = RandomizedSearchCV(logreg, parameter_grid,
                     n_iter = 10,
                     scoring='accuracy',
                     cv=5,
                     refit=True, 
                     return_train_score = True,
                     random_state=0)

 search = clf.fit(X_train,y_train)
 predictions=clf.predict(X_test)

 print("Model accuracy {}%".format(accuracy_score(y_test,predictions)*100))

 cv_results_df = pd.DataFrame(clf.cv_results_)

 column = cv_results_df.loc[:, ['params']]
 print(column)

  # Extract and print the row that had the best mean test score
  best_row = cv_results_df[cv_results_df['rank_test_score'] == 1 ]
  print(best_row)

  #print(clf.cv_results_)
  #print(clf.best_index_) you can use with iloc to slice the best row
  print(clf.best_params_)
  print(clf.best_score_)

Upvotes: 0

desertnaut
desertnaut

Reputation: 60321

According to the documentation for the param_distributions argument (here parameters):

Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly.

So, what is happening at each iteration is:

  • Sample a value for C according to a uniform distribution in [0, 4]
  • Sample a value for penalty, uniformly between l1 and l2 (i.e with 50% probability for each)
  • Use these sampled values for running a CV and store the results

Using the example from the documentation (practically identical with the parameters in your question):

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
                               random_state=0)
distributions = dict(C=uniform(loc=0, scale=4),
                      penalty=['l2', 'l1'])

clf = RandomizedSearchCV(logistic, distributions, random_state=0)
search = clf.fit(iris.data, iris.target)

we get

search.best_params_
# {'C': 2.195254015709299, 'penalty': 'l1'}

We can go a step further, and see all the (10) combinations used, along with their performance:

import pandas as pd
df = pd.DataFrame(search.cv_results_)
print(df[['params','mean_test_score']])
# result:
                                        params  mean_test_score
0    {'C': 2.195254015709299, 'penalty': 'l1'}         0.980000
1   {'C': 3.3770629943240693, 'penalty': 'l1'}         0.980000
2   {'C': 2.1795327319875875, 'penalty': 'l1'}         0.980000
3   {'C': 2.4942547871438894, 'penalty': 'l2'}         0.980000
4     {'C': 1.75034884505077, 'penalty': 'l2'}         0.980000
5  {'C': 0.22685190926977272, 'penalty': 'l2'}         0.966667
6   {'C': 1.5337660753031108, 'penalty': 'l2'}         0.980000
7   {'C': 3.2486749151019727, 'penalty': 'l2'}         0.980000
8   {'C': 2.2721782443757292, 'penalty': 'l1'}         0.980000
9     {'C': 3.34431505414951, 'penalty': 'l2'}         0.980000

from where it is apparent indeed that all values of C tried were in [0, 4], as requested. Also, since there were more than one combinations that achieved a best score of 0.98, scikit-learn uses the first one as returned in cv_results_.

Looking closely, we see that only 4 trials were run with l1 penalty (and not the 50% of the 10, i.e. 5, as we might expect), but this is something to be expected with small random samples (here only 10).

Upvotes: 4

Related Questions