Jan Erst
Jan Erst

Reputation: 89

Probabilistic SVM, regression

I've currently implemented a probabilistic (at least I think so) for binary classes. Now I want to extend this approach for regression, and I'm trying to use it for the Boston dataset. Unfortunately, it seems like my algorithm is stuck, the code I'm currently running is looking like this:

from sklearn import decomposition
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston

boston = load_boston()

X = boston.data
y = boston.target
inputs_train, inputs_test, targets_train, targets_test = train_test_split(X, y, test_size=0.33, random_state=42)

def plotting():
    param_C = [0.01, 0.1]
    param_grid = {'C': param_C, 'kernel': ['poly', 'rbf'], 'gamma': [0.1, 0.01]}
    clf = GridSearchCV(svm.SVR(), cv = 5, param_grid= param_grid)
    clf.fit(inputs_train, targets_train)
    clf = SVR(C=clf.best_params_['C'], cache_size=200, class_weight=None, coef0=0.0,
              decision_function_shape='ovr', degree=5, gamma=clf.best_params_['gamma'],
              kernel=clf.best_params_['kernel'],
              max_iter=-1, probability=True, random_state=None, shrinking=True,
              tol=0.001, verbose=False)
    clf.fit(inputs_train, targets_train)
    a = clf.predict(inputs_test[0])
    print(a)


plotting()

Can someone tell me, what is wrong in this approach, It's not the fact that I get some error message (I know, I've suppresed them above), but the code never stops running. Any suggestions is hugely appreciated.

Upvotes: 2

Views: 5325

Answers (1)

desertnaut
desertnaut

Reputation: 60388

There are several issues with your code.

  • To start with, what is taking forever is the first clf.fit (i.e. the grid search one), and that's why you didn't see any change when you set max_iter and tol in your second clf.fit.

  • Second, the clf=SVR() part will not work, because:

    • You have to import it, SVR is not recognizable
    • You have a bunch of illegal arguments in there (decision_function_shape, probability, random_state etc) - check the docs for the admissible SVR arguments.
  • Third, you don't need to explicitly fit again with the best parameters; you should simply ask for refit=True in your GridSearchCV definition and subsequently use clf.best_estimator_ for your predictions (EDIT after comment: simply clf.predict will also work).

So, moving the stuff outside of any function definition, here is a working version of your code:

from sklearn.svm import SVR
# other imports as-is

# data loading & splitting as-is

param_C = [0.01, 0.1]
param_grid = {'C': param_C, 'kernel': ['poly', 'rbf'], 'gamma': [0.1, 0.01]}
clf = GridSearchCV(SVR(degree=5, max_iter=10000), cv = 5, param_grid= param_grid, refit=True,)
clf.fit(inputs_train, targets_train)
a = clf.best_estimator_.predict(inputs_test[0])
# a = clf.predict(inputs_test[0]) will also work 
print(a)
# [ 21.89849792]

Apart from degree, all the other admissible argument values you are are using are actually the respective default values, so the only arguments you really need in your SVR definition are degree and max_iter.

You'll get a couple of warnings (not errors), i.e. after fitting:

/databricks/python/lib/python3.5/site-packages/sklearn/svm/base.py:220: ConvergenceWarning: Solver terminated early (max_iter=10000). Consider pre-processing your data with StandardScaler or MinMaxScaler.

and after predicting:

/databricks/python/lib/python3.5/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)

which already contain some advice for what to do next...

Last but not least: a probabilistic classifier (i.e. one that produces probabilities instead of hard labels) is a valid thing, but a "probabilistic" regression model is not...

Tested with Python 3.5 and scikit-learn 0.18.1

Upvotes: 3

Related Questions