Reputation: 89
I've currently implemented a probabilistic (at least I think so) for binary classes. Now I want to extend this approach for regression, and I'm trying to use it for the Boston dataset. Unfortunately, it seems like my algorithm is stuck, the code I'm currently running is looking like this:
from sklearn import decomposition
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target
inputs_train, inputs_test, targets_train, targets_test = train_test_split(X, y, test_size=0.33, random_state=42)
def plotting():
param_C = [0.01, 0.1]
param_grid = {'C': param_C, 'kernel': ['poly', 'rbf'], 'gamma': [0.1, 0.01]}
clf = GridSearchCV(svm.SVR(), cv = 5, param_grid= param_grid)
clf.fit(inputs_train, targets_train)
clf = SVR(C=clf.best_params_['C'], cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=5, gamma=clf.best_params_['gamma'],
kernel=clf.best_params_['kernel'],
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
clf.fit(inputs_train, targets_train)
a = clf.predict(inputs_test[0])
print(a)
plotting()
Can someone tell me, what is wrong in this approach, It's not the fact that I get some error message (I know, I've suppresed them above), but the code never stops running. Any suggestions is hugely appreciated.
Upvotes: 2
Views: 5325
Reputation: 60388
There are several issues with your code.
To start with, what is taking forever is the first clf.fit
(i.e. the grid search one), and that's why you didn't see any change when you set max_iter
and tol
in your second clf.fit
.
Second, the clf=SVR()
part will not work, because:
SVR
is not recognizable decision_function_shape
, probability
, random_state
etc) - check the docs for the admissible SVR
arguments.Third, you don't need to explicitly fit again with the best parameters; you should simply ask for refit=True
in your GridSearchCV
definition and subsequently use clf.best_estimator_
for your predictions (EDIT after comment: simply clf.predict
will also work).
So, moving the stuff outside of any function definition, here is a working version of your code:
from sklearn.svm import SVR
# other imports as-is
# data loading & splitting as-is
param_C = [0.01, 0.1]
param_grid = {'C': param_C, 'kernel': ['poly', 'rbf'], 'gamma': [0.1, 0.01]}
clf = GridSearchCV(SVR(degree=5, max_iter=10000), cv = 5, param_grid= param_grid, refit=True,)
clf.fit(inputs_train, targets_train)
a = clf.best_estimator_.predict(inputs_test[0])
# a = clf.predict(inputs_test[0]) will also work
print(a)
# [ 21.89849792]
Apart from degree
, all the other admissible argument values you are are using are actually the respective default values, so the only arguments you really need in your SVR
definition are degree
and max_iter
.
You'll get a couple of warnings (not errors), i.e. after fitting:
/databricks/python/lib/python3.5/site-packages/sklearn/svm/base.py:220: ConvergenceWarning: Solver terminated early (max_iter=10000). Consider pre-processing your data with StandardScaler or MinMaxScaler.
and after predicting:
/databricks/python/lib/python3.5/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
which already contain some advice for what to do next...
Last but not least: a probabilistic classifier (i.e. one that produces probabilities instead of hard labels) is a valid thing, but a "probabilistic" regression model is not...
Tested with Python 3.5 and scikit-learn 0.18.1
Upvotes: 3