Reputation: 81
I am trying to create a SV Regression. I am generating the data from sinc function with some Gaussian noise.
Now, in oder to find the best parameters to for RBF kernel, I am using GridSearchCV by running 5-fold cross validation.
P.S - I am new to python and machine learning, so maybe code is not very optimised or correct in some way.
My code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
def generateData(N, sigmaT):
# Input datapoints
data = np.reshape(np.linspace(-10, 10, N), (N,1))
# Noise in target with zero mean and variance sigmaT
epi = np.random.normal(0 , sigmaT, N)
# Target
t1 = np.sinc(data).ravel() # target without noise
t2 = np.sinc(data).ravel() + epi # target with noise
t1 = np.reshape(t1, (N, 1))
t2 = np.reshape(t2, (N, 1))
# Plot the generated data
plt.plot(data, t1, '--r', label = 'Original Curve')
plt.scatter(data, t2, c = 'orange', label = 'Data')
plt.title("Generated data")
return data, t2, t1
# Generate data from sin funtion
N = 100 # Number of data points
sigmaT = 0.1 # Noise in the data
plt.figure(1)
X, y, true = generateData(N, sigmaT)
y = y.ravel()
# Tuning of parameters for regression by cross-validation
K = 5 # Number of cross valiations
# Parameters for tuning
parameters = [{'kernel': ['rbf'], 'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5, 0.6, 0.9],'C': [1, 10, 100, 1000, 10000]}]
print("Tuning hyper-parameters")
svr = GridSearchCV(SVR(epsilon = 0.01), parameters, cv = K)
svr.fit(X, y)
# Checking the score for all parameters
print("Grid scores on training set:")
means = svr.cv_results_['mean_test_score']
stds = svr.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, svr.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"% (mean, std * 2, params))
And the result is
Best parameters set found on development set: {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1}
Grid scores on training set:
-0.240 (+/-0.366) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1}
-0.535 (+/-1.076) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1}
-0.863 (+/-1.379) for {'gamma': 0.01, 'kernel': 'rbf', 'C': 1}
-3.057 (+/-4.954) for {'gamma': 0.1, 'kernel': 'rbf', 'C': 1}
-1.576 (+/-3.185) for {'gamma': 0.2, 'kernel': 'rbf', 'C': 1}
-0.439 (+/-0.048) for {'gamma': 0.5, 'kernel': 'rbf', 'C': 1}
-0.417 (+/-0.110) for {'gamma': 0.6, 'kernel': 'rbf', 'C': 1}
-0.370 (+/-0.248) for {'gamma': 0.9, 'kernel': 'rbf', 'C': 1}
-0.514 (+/-0.724) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 10}
-1.308 (+/-3.002) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 10}
-4.717 (+/-10.886) for {'gamma': 0.01, 'kernel': 'rbf', 'C': 10}
-14.247 (+/-27.218) for {'gamma': 0.1, 'kernel': 'rbf', 'C': 10}
-15.241 (+/-19.086) for {'gamma': 0.2, 'kernel': 'rbf', 'C': 10}
-0.533 (+/-0.571) for {'gamma': 0.5, 'kernel': 'rbf', 'C': 10}
-0.566 (+/-0.527) for {'gamma': 0.6, 'kernel': 'rbf', 'C': 10}
-1.087 (+/-1.828) for {'gamma': 0.9, 'kernel': 'rbf', 'C': 10}
-0.591 (+/-1.218) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 100}
-2.111 (+/-2.940) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 100}
-19.591 (+/-29.731) for {'gamma': 0.01, 'kernel': 'rbf', 'C': 100}
-96.461 (+/-96.744) for {'gamma': 0.1, 'kernel': 'rbf', 'C': 100}
-14.430 (+/-10.858) for {'gamma': 0.2, 'kernel': 'rbf', 'C': 100}
-14.742 (+/-37.705) for {'gamma': 0.5, 'kernel': 'rbf', 'C': 100}
-7.915 (+/-10.308) for {'gamma': 0.6, 'kernel': 'rbf', 'C': 100}
-1.592 (+/-1.513) for {'gamma': 0.9, 'kernel': 'rbf', 'C': 100}
-1.543 (+/-3.654) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1000}
-4.629 (+/-10.477) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1000}
-65.690 (+/-92.825) for {'gamma': 0.01, 'kernel': 'rbf', 'C': 1000}
-2745.336 (+/-4173.978) for {'gamma': 0.1, 'kernel': 'rbf', 'C': 1000}
-248.269 (+/-312.776) for {'gamma': 0.2, 'kernel': 'rbf', 'C': 1000}
-65.826 (+/-132.946) for {'gamma': 0.5, 'kernel': 'rbf', 'C': 1000}
-28.569 (+/-64.979) for {'gamma': 0.6, 'kernel': 'rbf', 'C': 1000}
-6.955 (+/-8.647) for {'gamma': 0.9, 'kernel': 'rbf', 'C': 1000}
-3.647 (+/-7.858) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 10000}
-12.712 (+/-29.380) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 10000}
-1094.270 (+/-2262.303) for {'gamma': 0.01, 'kernel': 'rbf', 'C': 10000}
-3698.268 (+/-8085.389) for {'gamma': 0.1, 'kernel': 'rbf', 'C': 10000}
-2079.620 (+/-3651.872) for {'gamma': 0.2, 'kernel': 'rbf', 'C': 10000}
-70.982 (+/-159.707) for {'gamma': 0.5, 'kernel': 'rbf', 'C': 10000}
-89.859 (+/-180.071) for {'gamma': 0.6, 'kernel': 'rbf', 'C': 10000}
-661.291 (+/-1636.522) for {'gamma': 0.9, 'kernel': 'rbf', 'C': 10000}
Now the GridSearchCV gives me best parameters as C:1, gamma:0.0001 but I checked that the parameters should be C:1000, gamma:0.5
Now my question is
Edit: I am also adding the code on how I found the correct parameters. I just tried to put all the parameters in the SVR and mean square error.
# Working parameters
svr = SVR(kernel='rbf', C=1e3, gamma = 0.5, epsilon = 0.01)
y_rbf = svr.fit(X, y).predict(X)
# Plotting
plt.figure(1)
plt.plot(X, y_rbf, c = 'navy', label = 'Predicted')
plt.legend()
# Checking prediction error
print("Mean squared error: %.2f" % mean_squared_error(true, y_rbf))
The plot on the above parameters is in the link, https://i.sstatic.net/8TH27.jpg
The plot from the GridSearchCV choosen parame https://i.sstatic.net/rv3Sb.jpg
Upvotes: 4
Views: 22894
Reputation: 36599
A couple of things play an important part here:
1) Scoring criteria used by GridSearch to find best params. Since you have not provided any value to scoring param of GridSearchCV, the scoring method of the SVR will be used which is R-squared value and not mean_squared_error as you have done.
That can be fixed by doing this:
from sklearn.metrics import make_scorer
scorer = make_scorer(mean_squared_error, greater_is_better=False)
svr_gs = GridSearchCV(SVR(epsilon = 0.01), parameters, cv = K, scoring=scorer)
2) The amount of data used by the GridSearch for training. The grid-search will split the data into train and test using the cv provided (in your case K=5, so a 5 fold approach will be used). This means that grid-search will train the SVR on train data and calculate the score on test data, not on whole data as you are doing. This will lead to changes in answer. For K=5, at a single time, only 80% of data is used in training. That means less data than what you are doing.
That can be fixed by increasing the value of K to maybe 15 or 20 or 25.
After doing these two changes, this is what I get:
Upvotes: 5