Reputation: 1114
I have an issue with an SVM model trained for binary classification using Spark 2.0.0. I have followed the same logic using scikit-learn and MLlib, using the exact same dataset. For scikit learn I have the following code:
svc_model = SVC()
svc_model.fit(X_train, y_train)
print "supposed to be 1"
print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
print svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0])
print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0])
print "supposed to be 0"
print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0])
print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0])
print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0])
print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0])
and it returns:
supposed to be 1
[0]
[1]
[1]
[1]
supposed to be 0
[0]
[0]
[0]
[0]
For spark am doing:
model_svm = SVMWithSGD.train(trainingData, iterations=100)
print "supposed to be 1"
print model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0))
print model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0))
print model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0))
print model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0))
print "supposed to be 0"
print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0))
print model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0))
which returns:
supposed to be 1
1
1
1
1
supposed to be 0
1
1
1
1
have tried to keep my positive-negative classes balanced my test data contain 3521 records and my training data 8356 records. For the evaluation, cross-validation applied on the scikit-learn model gives 98% accuracy and for spark the area under ROC is 0.5, the are under PR is 0.74 and 0.47 training error.
I have also tried to clear the threshold and set it back to 0.5, but this did not return any better results. Sometimes when I am changing the train-test splitting I might get i.e. all zeros except for one correct prediction or all ones except for one correct zero prediction. Does anyone know how to approach this problem?
As I said I have checked multiple times that my dataset is exactly the same in both cases.
Upvotes: 1
Views: 1517
Reputation: 22238
You're using different classifiers and so getting different results. Sklearn's SVC is a SVM with RBF kernel; SVMWithSGD is an SVM with a linear kernel trained using SGD. They are totally different.
If you want to match the results then I think the way to go is to use sklearn.linear_model.SGDClassifier(loss='hinge')
and try to match other parameters (regularization, whether to fit intercept, etc.) because defaults are not the same.
Upvotes: 3
Reputation: 36545
Your call to clearThreshold
, is causing the classifier to return the raw prediction scores:
clearThreshold() Note Experimental Clears the threshold so that predict will output raw prediction scores. It is used for binary classification only.
New in version 1.4.0.
If you want just the prediction class, remove this function call.
Upvotes: 1