Reputation:
i would like to clarify one thing. i know that following command will generate a uniformly distributed random variable between(loc, loc+scale)
from scipy.stats import uniform
C =uniform.rvs(loc=0,scale=4)
print(C)
and let us suppose that i want to use this distribution in logistic regression while using RandomiizedSearchCV algorithm as it is shown below :
parameters =dict(C =uniform(loc=0,scale=4),penalty=['l2', 'l1'])
from sklearn.model_selection import RandomizedSearchCV
clf = RandomizedSearchCV(logreg, parameters, random_state=0)
search = clf.fit(iris.data, iris.target)
print(search.best_params_)
but i did not understand one thing: RandomizedSearchCV is like a gridsearch, just it tries to select random number of combination with given amount of trial (n_iter
), but here C
is a object, it is not array or something like this, even i can't print its value, so how can i understand this code? how it generates random number? without indication of rvs?
Upvotes: 1
Views: 1239
Reputation: 4233
you want to use more than one value for C for RandomizedSearchCV to discover. refit=True and return_train_score=True allow you to use the clf with the best model fit.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42)
logreg=LogisticRegression(C=5,max_iter=10000)
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
parameter_grid={'C':[uniform.rvs(loc=0,scale=4),3,4,10,100],'tol':[0.1,0.2,0.3,0.5,1,5,10,100]}
clf = RandomizedSearchCV(logreg, parameter_grid,
n_iter = 10,
scoring='accuracy',
cv=5,
refit=True,
return_train_score = True,
random_state=0)
search = clf.fit(X_train,y_train)
predictions=clf.predict(X_test)
print("Model accuracy {}%".format(accuracy_score(y_test,predictions)*100))
cv_results_df = pd.DataFrame(clf.cv_results_)
column = cv_results_df.loc[:, ['params']]
print(column)
# Extract and print the row that had the best mean test score
best_row = cv_results_df[cv_results_df['rank_test_score'] == 1 ]
print(best_row)
#print(clf.cv_results_)
#print(clf.best_index_) you can use with iloc to slice the best row
print(clf.best_params_)
print(clf.best_score_)
Upvotes: 0
Reputation: 60321
According to the documentation for the param_distributions
argument (here parameters
):
Dictionary with parameters names (
str
) as keys and distributions or lists of parameters to try. Distributions must provide arvs
method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly.
So, what is happening at each iteration is:
C
according to a uniform distribution in [0, 4]
penalty
, uniformly between l1
and l2
(i.e with 50% probability for each)Using the example from the documentation (practically identical with the parameters in your question):
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
random_state=0)
distributions = dict(C=uniform(loc=0, scale=4),
penalty=['l2', 'l1'])
clf = RandomizedSearchCV(logistic, distributions, random_state=0)
search = clf.fit(iris.data, iris.target)
we get
search.best_params_
# {'C': 2.195254015709299, 'penalty': 'l1'}
We can go a step further, and see all the (10) combinations used, along with their performance:
import pandas as pd
df = pd.DataFrame(search.cv_results_)
print(df[['params','mean_test_score']])
# result:
params mean_test_score
0 {'C': 2.195254015709299, 'penalty': 'l1'} 0.980000
1 {'C': 3.3770629943240693, 'penalty': 'l1'} 0.980000
2 {'C': 2.1795327319875875, 'penalty': 'l1'} 0.980000
3 {'C': 2.4942547871438894, 'penalty': 'l2'} 0.980000
4 {'C': 1.75034884505077, 'penalty': 'l2'} 0.980000
5 {'C': 0.22685190926977272, 'penalty': 'l2'} 0.966667
6 {'C': 1.5337660753031108, 'penalty': 'l2'} 0.980000
7 {'C': 3.2486749151019727, 'penalty': 'l2'} 0.980000
8 {'C': 2.2721782443757292, 'penalty': 'l1'} 0.980000
9 {'C': 3.34431505414951, 'penalty': 'l2'} 0.980000
from where it is apparent indeed that all values of C
tried were in [0, 4]
, as requested. Also, since there were more than one combinations that achieved a best score of 0.98, scikit-learn uses the first one as returned in cv_results_
.
Looking closely, we see that only 4 trials were run with l1
penalty (and not the 50% of the 10, i.e. 5, as we might expect), but this is something to be expected with small random samples (here only 10).
Upvotes: 4