Reputation: 3196
In the example above, I'm using my dataset to identify outliers. After making slight changes to the nu
parameter, there is a huge difference in the number of anomalies identified.
Could this be just a particularity of the dataset? Or a bug in scikit-learn?
P.S. Unfortunately I cannot share the dataset.
Upvotes: 2
Views: 607
Reputation: 21
If you decrease the value of the tol
parameter of the OneClassSVM
the result is better although not completely as expected for low values of nu.
import numpy as np
from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt
X = np.random.rand(100, 1)
nus = np.geomspace(0.0001, 0.5, num=100)
outlier_fraction = np.zeros(len(nus))
for i, nu in enumerate(nus):
outlier_fraction[i] = (OneClassSVM(nu=nu, tol=1e-12).fit_predict(X) == -1).mean()
plt.plot(nus, outlier_fraction)
plt.xlabel('nu')
plt.ylabel('Outlier fraction')
plt.show()
With the default tol
you obtain the following
Upvotes: 2
Reputation: 1618
NOTE: not an answer. Offering a MCVE.
I also recently came across this. I would like to understand the inflection point at the low values
import numpy as np
import pandas as pd
from sklearn.svm import OneClassSVM
X = np.random.rand(100, 1)
nu = np.geomspace(0.0001, 1, num=100)
df = pd.DataFrame(data={'nu': nu})
for i in range(0, len(X)):
df.loc[i, 'anom_count'] = (OneClassSVM(nu=df.loc[i, 'nu']).fit_predict(X) == -1).sum()
df.set_index('nu').plot();
df.set_index('nu').plot(xlim=(0, 0.2));
df.anom_count.min() # 3
df.anom_count.idxmin() # 62
df.loc[df.anom_count.idxmin(), 'nu'] # 0.031
Upvotes: 1