Ankita
Ankita

Reputation: 485

Gaussian Process regression hyparameter optimisation using python Grid search

I am started learning Gaussian regression using Sklearn library using my own data points as given below. though I got the result it is inaccurate because I did not do hyperparameter optimisation. I did some couple of google search and written gridsearchcode. But the code is not running as expected. I don't know where I made my mistake, please help and thanks in advance.

The sample of input and output data is given as follows

X_tr= [10.8204  7.67418 7.83013 8.30996 8.1567  6.94831 14.8673 7.69338 7.67702 12.7542 11.847] 
y_tr= [1965.21  854.386 909.126 1094.06 1012.6  607.299 2294.55 866.316 822.948 2255.32 2124.67]
X_te= [7.62022  13.1943 7.76752 8.36949 7.86459 7.16032 12.7035 8.99822 6.32853 9.22345 11.4751]

X_tr, y_tr and X_te are the training data points and are reshape values and have a type of 'Array of float64'

Here my grid search code

from sklearn.model_selection import GridSearchCV

                tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],   
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(
        gp(), tuned_parameters, scoring='%s_macro' % score
    )
    clf.fit(X_tr, y_tr)

Here is a sample of my code without hyperparameter optimisation:

import sklearn.gaussian_process as gp
kernel = gp.kernels.ConstantKernel(1.0, (1e-1, 1e3)) * gp.kernels.RBF(10.0, (1e-3, 1e3))
model = gp.GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, alpha=0.1, normalize_y=True)

X_tr=np.array([X_tr])
X_te=np.array([X_te])
y_tr=np.array([y_tr])

model.fit(X_tr, y_tr)
params = model.kernel_.get_params()
X_te = X_te.reshape(-1,1)
y_pred, std = model.predict(X_te, return_std=True)

Upvotes: 3

Views: 8062

Answers (1)

LeoC
LeoC

Reputation: 932

There were a few issues in that code snippet you provided, the one below is a working example:

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.gaussian_process.kernels import RBF, DotProduct
import numpy as np

X_tr = np.array([10.8204, 7.67418, 7.83013, 8.30996, 8.1567, 6.94831, 14.8673, 7.69338, 7.67702, 12.7542, 11.847])
y_tr = np.array([1965.21, 854.386, 909.126, 1094.06, 1012.6, 607.299, 2294.55, 866.316, 822.948, 2255.32, 2124.67])
X_te = np.array([7.62022, 13.1943, 7.76752, 8.36949, 7.86459, 7.16032, 12.7035, 8.99822, 6.32853, 9.22345, 11.4751])


param_grid = [{
    "alpha":  [1e-2, 1e-3],
    "kernel": [RBF(l) for l in np.logspace(-1, 1, 2)]
}, {
    "alpha":  [1e-2, 1e-3],
    "kernel": [DotProduct(sigma_0) for sigma_0 in np.logspace(-1, 1, 2)]
}]

# scores for regression
scores = ['explained_variance', 'r2']

gp = GaussianProcessRegressor()
for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(estimator=gp, param_grid=param_grid, cv=4,
                       scoring='%s' % score)
    clf.fit(X_tr.reshape(-1, 1), y_tr)
    print(clf.best_params_)

I would like to break it down now in order to provide some explanation. The first part is data. You will need more data (presumably you only gave a sample here) but you will also need to rescale it for the gaussian process to work efficiently.

The second part is the param_grid. The parameter grid can be a dictionary or a list of dictionaries. I used a list of dictionaries as it appears that you are interested in testing the performance of different kernels. The granularity of the parameter grid is very low when you add more data I would recommend to increase the granularity by adding more test variables for alpha and increasing the np.logpspace steps as well as bounds.

The third part is the scores to test. In the snippet above you had scores for classification algorithms I used scores for regression as you are interested in regression.

The fourth part runs the model. It should print the best parameters for each score. I couldn't get any reliable fits because the dataset was really limited. Note the reshape of the X_tr input as it's one-dimensional.

Upvotes: 6

Related Questions