wmac
wmac

Reputation: 1063

SKLearn Multiclass Classifier

I have written the following code to import data vectors from file and test the performance of SVM classifier (using sklearn and python).

However the classifier performance is lower than any other classifier (NNet for example gives 98% accuracy on test data but this gives 92% at best). In my experience SVM should produce better results for this kind of data.

Am I possibly doing something wrong?

import numpy as np

def buildData(featureCols, testRatio):
    f = open("car-eval-data-1.csv")
    data = np.loadtxt(fname = f, delimiter = ',')

    X = data[:, :featureCols]  # select columns 0:featureCols-1
    y = data[:, featureCols]   # select column  featureCols 

    n_points = y.size
    print "Imported " + str(n_points) + " lines."

    ### split into train/test sets
    split = int((1-testRatio) * n_points)
    X_train = X[0:split,:]
    X_test  = X[split:,:]
    y_train = y[0:split]
    y_test  = y[split:]

    return X_train, y_train, X_test, y_test

def buildClassifier(features_train, labels_train):
    from sklearn import svm

    #clf = svm.SVC(kernel='linear',C=1.0, gamma=0.1)
    #clf = svm.SVC(kernel='poly', degree=3,C=1.0, gamma=0.1)
    clf = svm.SVC(kernel='rbf',C=1.0, gamma=0.1)
    clf.fit(features_train, labels_train)
    return clf

def checkAccuracy(clf, features, labels):
    from sklearn.metrics import accuracy_score

    pred = clf.predict(features)
    accuracy = accuracy_score(pred, labels)
    return accuracy

features_train, labels_train, features_test, labels_test = buildData(6, 0.3)
clf           = buildClassifier(features_train, labels_train)
trainAccuracy = checkAccuracy(clf, features_train, labels_train)
testAccuracy  = checkAccuracy(clf, features_test, labels_test)
print "Training Items: " + str(labels_train.size) + ", Test Items: " + str(labels_test.size)
print "Training Accuracy: " + str(trainAccuracy)
print "Test Accuracy: " + str(testAccuracy)

i = 0
while i < labels_test.size:
  pred = clf.predict(features_test[i])
  print "F(" + str(i) + ") : " + str(features_test[i]) + " label= " + str(labels_test[i]) + " pred= " + str(pred);
  i = i + 1

How is it possible to do multi-class classification if it does not do it by default?

p.s. my data is of the following format (last column is the class):

2,2,2,2,2,1,0
2,2,2,2,1,2,0
0,2,2,5,2,2,3
2,2,2,4,2,2,1
2,2,2,4,2,0,0
2,2,2,4,2,1,1
2,2,2,4,1,2,1
0,2,2,5,2,2,3

Upvotes: 0

Views: 1326

Answers (1)

wmac
wmac

Reputation: 1063

I found the problem after a long time and I am posting it, in case someone needs it.

The problem was that the data import function wouldn't shuffle the data. If the data is somehow sorted, then there is the risk that you train the classifier with some data and test it with totally different data. In the NNet case, Matlab was used which automatically shuffles the input data.

def buildData(filename, featureCols, testRatio):
f = open(filename)
data = np.loadtxt(fname = f, delimiter = ',')
np.random.shuffle(data)    # randomize the order

X = data[:, :featureCols]  # select columns 0:featureCols-1
y = data[:, featureCols]   # select column  featureCols 

n_points = y.size
print "Imported " + str(n_points) + " lines."

### split into train/test sets
split = int((1-testRatio) * n_points)
X_train = X[0:split,:]
X_test  = X[split:,:]
y_train = y[0:split]
y_test  = y[split:]

return X_train, y_train, X_test, y_test

Upvotes: 2

Related Questions