SKMohammadi
SKMohammadi

Reputation: 200

Different accuracy for python (Scikit-Learn) and R (e1071)

For the same dataset (here Bupa) and parameters i get different accuracies.

What did I overlook?

R implementation:

data_file = "bupa.data"
dataset = read.csv(data_file, header = FALSE)
nobs <- nrow(dataset) # 303 observations 
sample <- train <- sample(nrow(dataset), 0.95*nobs) # 227 observations
# validate <- sample(setdiff(seq_len(nrow(dataset)), train), 0.1*nobs) # 30 observations
test <- setdiff(seq_len(nrow(dataset)), train) # 76 observations
svmfit <- svm(V7~ .,data=dataset[train,],
              type="C-classification",
              kernel="linear",
              cost=1,
              cross=10)
testpr <- predict(svmfit, newdata=na.omit(dataset[test,]))
accuracy <- sum(testpr==na.omit(dataset[test,])$V7)/length(na.omit(dataset[test,])$V7)

I get accuracy: 0.94

but when i do as following in python (scikit-learn)

import numpy as np
from sklearn import cross_validation
from sklearn import datasets
import pandas as pd
from sklearn import svm, grid_search

f = open("data/bupa.data")
dataset =  np.loadtxt(fname = f, delimiter = ',')
nobs = np.shape(dataset)[0]
print("Number of Observations: %d" % nobs)
y = dataset[:,6]
X = dataset[:,:-1]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.06, random_state=0)

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy')

I get accuracy 0.67

please help me.

Upvotes: 4

Views: 1761

Answers (3)

David
David

Reputation: 1999

When using Support Vector Regression in Python/sklearn and R/e1071 both x and y variables need to be scaled/unscaled. Here is a self-contained example using rpy2 to show equivalence of R and Python results (first part with disabled scaling in R, second part with 'manual' scaling in Python):

# import modules
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.model_selection
import sklearn.datasets
import sklearn.svm
import rpy2
import rpy2.robjects
import rpy2.robjects.packages

# use R e1071 SVM function via rpy2
def RSVR(x_train, y_train, x_test,
         cost=1.0, epsilon=0.1, gamma=0.01, scale=False):

    # convert Python arrays to R matrices
    rx_train = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_train).T.flatten()), nrow = len(x_train))
    ry_train = rpy2.robjects.FloatVector(np.array(y_train).flatten())
    rx_test = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_test).T.flatten()), nrow = len(x_test))

    # train SVM
    e1071 = rpy2.robjects.packages.importr('e1071')
    rsvr = e1071.svm(x=rx_train,
                     y=ry_train,
                     kernel='radial',
                     cost=cost,
                     epsilon=epsilon,
                     gamma=gamma,
                     scale=scale)

    # run SVM
    predict = rpy2.robjects.r['predict']
    ry_pred = np.array(predict(rsvr, rx_test))

    return ry_pred

# define auxiliary function for plotting results
def plot_results(y_test, py_pred, ry_pred, title, lim=[-500, 500]):
    plt.title(title)
    plt.plot(lim, lim, lw=2, color='gray', zorder=-1)
    plt.scatter(y_test, py_pred, color='black', s=40, label='Python/sklearn')
    plt.scatter(y_test, ry_pred, color='orange', s=10, label='R/e1071')
    plt.xlabel('observed')
    plt.ylabel('predicted')
    plt.legend(loc=0)
    return None

# get example regression data
x_orig, y_orig = sklearn.datasets.make_regression(n_samples=100, n_features=10, random_state=42)

# split into train and test set
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x_orig, y_orig, train_size=0.8)

# SVM parameters
# (identical but named differently for R/e1071 and Python/sklearn)
C = 1000.0
epsilon = 0.1
gamma = 0.01

# setup SVM and scaling classes
psvr = sklearn.svm.SVR(kernel='rbf', C=C, epsilon=epsilon, gamma=gamma)
x_sca = sklearn.preprocessing.StandardScaler()
y_sca = sklearn.preprocessing.StandardScaler()

# run R and Python SVMs without any scaling
# (see 'scale=False')
py_pred = psvr.fit(x_train, y_train).predict(x_test)
ry_pred = RSVR(x_train, y_train, x_test,
               cost=C, epsilon=epsilon, gamma=gamma, scale=False)

# scale both x and y variables
sx_train = x_sca.fit_transform(x_train)
sy_train = y_sca.fit_transform(y_train.reshape(-1, 1))[:, 0]
sx_test = x_sca.transform(x_test)
sy_test = y_sca.transform(y_test.reshape(-1, 1))[:, 0]

# run Python SVM on scaled data and invert scaling afterwards
ps_pred = psvr.fit(sx_train, sy_train).predict(sx_test)
ps_pred = y_sca.inverse_transform(ps_pred.reshape(-1, 1))[:, 0]

# run R SVM with native scaling on original/unscaled data
# (see 'scale=True')
rs_pred = RSVR(x_train, y_train, x_test,
               cost=C, epsilon=epsilon, gamma=gamma, scale=True)

# plot results
plt.subplot(121)
plot_results(y_test, py_pred, ry_pred, 'without scaling (Python/sklearn default)')
plt.subplot(122)
plot_results(y_test, ps_pred, rs_pred, 'with scaling (R/e1071 default)')
plt.tight_layout()

Observed vs. predicted values in Python (black) and R (orange)

UPDATE: Actually, the scaling uses a slightly different definition of variance in R and Python, see this answer (1/(N-1)... in R vs. 1/N... in Python where N is the sample size). However, for typical sample sizes, this should be negligible.

Upvotes: 1

rafalbachorz
rafalbachorz

Reputation: 1

I can confirm these statements. One indeed needs to apply the same scaling to the train and test sets. In particular I have done this:

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)

where X is my training set. Then, when preparing the test set, I have simply used the StandardScaler instance obtained from the scaling of the training test. It is important to used it just for transforming, not for fitting and transforming (like above), i.e.:

X_test = sc_X.transform(X_test)

This allowed on obtaining substantial agreement between R and scikit-learn results.

Upvotes: 0

learner
learner

Reputation: 1985

I came across this post having the same issue - wildly different accuracy between scikit-learn and e1071 bindings for libSVM. I think the issue is that e1071 scales the training data and then keeps the scaling parameters for using in predicting new observations. Scikit-learn does not do this and leaves it up the user to realize that the same scaling approach needs to be taken on both training and test data. I only thought to check this after encountering and reading this guide from the nice people behind libSVM.

While I don't have your data, str(svmfit) should give you the scaling params (mean and standard deviation of the columns of Bupa). You can use these to appropriately scale your data in Python (see below for an idea). Alternately, you can scale the entire dataset together in Python and then do test/train splits; either way should give you now identical predictions.

def manual_scale(a, means, sds):
    a1 = a - means
    a1 = a1/sds
    return a1 

Upvotes: 4

Related Questions