Reputation: 3455
Is it possible to use GridSearchCV without cross validation? I am trying to optimize the number of clusters in KMeans clustering via grid search, and thus I don't need or want cross validation.
The documentation is also confusing me because under the fit() method, it has an option for unsupervised learning (says to use None for unsupervised learning). But if you want to do unsupervised learning, you need to do it without cross validation and there appears to be no option to get rid of cross validation.
Upvotes: 34
Views: 21364
Reputation: 935
You can create your own GridSearch using ParameterGrid
.
For example:
from sklearn.model_selection import ParameterGrid
param_grid = {'a': [1, 2], 'b': [True, False]}
param_candidates = ParameterGrid(param_grid)
print(f'{len(param_candidates)} candidates')
results = []
for i, params in enumerate(param_candidates):
model = estimator.set_params(**params)
model.fit(X_train, y_train)
score = model.score(X_val, y_val)
results.append([params, score])
print(f'{i+1}/{len(param_candidates)}: ', params, score)
print(max(results, key=lambda x: x[1]))
To increase performance I would suggest parallelizing the loop:
from joblib import Parallel, delayed
param_grid = {'a': [1, 2], 'b': [True, False]}
param_candidates = ParameterGrid(param_grid)
print(f'{len(param_candidates)} candidates')
def fit_model(params):
model = estimator.set_params(**params)
model.fit(X_train, y_train)
score = model.score(X_val, y_val)
return [params, score]
results = Parallel(n_jobs=-1)(delayed(fit_model)(params) for params in param_candidates)
print(max(results, key=lambda x: x[1]))
Upvotes: 3
Reputation: 61
I recently came out with the following custom cross-validator, based on this answer. I passed it to GridSearchCV
and it properly disabled the cross-validation for me:
import numpy as np
class DisabledCV:
def __init__(self):
self.n_splits = 1
def split(self, X, y, groups=None):
yield (np.arange(len(X)), np.arange(len(y)))
def get_n_splits(self, X, y, groups=None):
return self.n_splits
I hope it can help.
Upvotes: 6
Reputation: 3455
After much searching, I was able to find this thread. It appears that you can get rid of cross validation in GridSearchCV if you use:
cv=[(slice(None), slice(None))]
I have tested this against my own coded version of grid search without cross validation and I get the same results from both methods. I am posting this answer to my own question in case others have the same issue.
Edit: to answer jjrr's question in the comments, here is an example use case:
from sklearn.metrics import silhouette_score as sc
def cv_silhouette_scorer(estimator, X):
estimator.fit(X)
cluster_labels = estimator.labels_
num_labels = len(set(cluster_labels))
num_samples = len(X.index)
if num_labels == 1 or num_labels == num_samples:
return -1
else:
return sc(X, cluster_labels)
cv = [(slice(None), slice(None))]
gs = GridSearchCV(estimator=sklearn.cluster.MeanShift(), param_grid=param_dict,
scoring=cv_silhouette_scorer, cv=cv, n_jobs=-1)
gs.fit(df[cols_of_interest])
Upvotes: 50
Reputation: 5173
I think that using cv=ShuffleSplit(test_size=0.20, n_splits=1) with n_splits=1 is a better solution like this post suggested
Upvotes: 7
Reputation: 10399
I'm going to answer your question since it seems like it has been unanswered still. Using the parallelism method with the for
loop, you can use the multiprocessing
module.
from multiprocessing.dummy import Pool
from sklearn.cluster import KMeans
import functools
kmeans = KMeans()
# define your custom function for passing into each thread
def find_cluster(n_clusters, kmeans, X):
from sklearn.metrics import silhouette_score # you want to import in the scorer in your function
kmeans.set_params(n_clusters=n_clusters) # set n_cluster
labels = kmeans.fit_predict(X) # fit & predict
score = silhouette_score(X, labels) # get the score
return score
# Now's the parallel implementation
clusters = [3, 4, 5]
pool = Pool()
results = pool.map(functools.partial(find_cluster, kmeans=kmeans, X=X), clusters)
pool.close()
pool.join()
# print the results
print(results) # will print a list of scores that corresponds to the clusters list
Upvotes: 9