Reputation: 479
I am using the one class SVM classifier OneClassSVM
from Scikit to determine outliers in a dataset. My dataset has 30000 samples with 1024 variables. I use 10 percent of those as training data.
clf=svm.OneClassSVM(nu=0.001,kernel="rbf",gamma=1e-5)
clf.fit(trset)
dist2hptr=clf.decision_function(trset)
tr_y=clf.predict(trset)
As above, I calculate the distance of each sample to the decision function using the decision_function(x)
function. When I compare the prediction results and the distance results, it always show positive distance for samples marked as +1 in predict output and negative distance values for samples marked as -1.
I thought distance doesn't have a sign since it does not deal with direction. I want to understand how the distances are calculated in OneClassSV
scikit classifier. Does the sign simply represent that the sample lies out of the decision hyperplane calculated by the SVM ?
Please help.
Upvotes: 0
Views: 692
Reputation: 33532
sklearn's OneClassSVM is implemented from the following paper as explained here:
Bernhard Schölkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. 2001. Estimating the Support of a High-Dimensional Distribution. Neural Comput. 13, 7 (July 2001), 1443-1471. DOI: https://doi.org/10.1162/089976601750264965
Let's have a look at the abstract of that paper here:
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.
We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.
So the abstract defines the function f
of OneClassSVM which is followed by sklearn.
Upvotes: 4