scikit-learn and mllib difference in predictions python

Question

I have an issue with an SVM model trained for binary classification using Spark 2.0.0. I have followed the same logic using scikit-learn and MLlib, using the exact same dataset. For scikit learn I have the following code:

svc_model = SVC()
svc_model.fit(X_train, y_train)

print "supposed to be 1"
print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
print svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0])
print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0])

print "supposed to be 0"
print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0])
print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0])
print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0])
print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0])

and it returns:

supposed to be 1
[0]
[1]
[1]
[1]
supposed to be 0
[0]
[0]
[0]
[0]

For spark am doing:

model_svm = SVMWithSGD.train(trainingData, iterations=100)

print "supposed to be 1"
print model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0))
print model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0))
print model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0))
print model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0))

print "supposed to be 0"
print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0))
print model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0))

which returns:

supposed to be 1
1
1
1
1
supposed to be 0
1
1
1
1

have tried to keep my positive-negative classes balanced my test data contain 3521 records and my training data 8356 records. For the evaluation, cross-validation applied on the scikit-learn model gives 98% accuracy and for spark the area under ROC is 0.5, the are under PR is 0.74 and 0.47 training error.

I have also tried to clear the threshold and set it back to 0.5, but this did not return any better results. Sometimes when I am changing the train-test splitting I might get i.e. all zeros except for one correct prediction or all ones except for one correct zero prediction. Does anyone know how to approach this problem?

As I said I have checked multiple times that my dataset is exactly the same in both cases.

Mikhail Korobov · Accepted Answer

You're using different classifiers and so getting different results. Sklearn's SVC is a SVM with RBF kernel; SVMWithSGD is an SVM with a linear kernel trained using SGD. They are totally different.

If you want to match the results then I think the way to go is to use sklearn.linear_model.SGDClassifier(loss='hinge') and try to match other parameters (regularization, whether to fit intercept, etc.) because defaults are not the same.

scikit-learn and mllib difference in predictions python

Answers (2)

Related Questions