Value Error while running SVM in Sklearn

Question

I have the following problem of doing support vector machine with numpy arrays.

import numpy as np
from sklearn import svm

I have 3 classes/labels (male, female, na), denoted as follows:

labels = [0,1,2]

Each class was defined by 3 variables (height, weight, age) as the training data:

male_height = np.array([111,121,137,143,157])
male_weight = np.array([60,70,88,99,75])
male_age = np.array([41,32,73,54,35])

males = np.hstack([male_height,male_weight,male_age])

female_height = np.array([91,121,135,98,90])
female_weight = np.array([32,67,98,86,56])
female_age = np.array([51,35,33,67,61])

females = np.hstack([female_height,female_weight,female_age])

na_height = np.array([96,127,145,99,91])
na_weight = np.array([42,97,78,76,86])
na_age = np.array([56,35,49,64,66])

nas = np.hstack([na_height,na_weight,na_age])

Now I have to fit the support vector machine method for the training data to predict the class given those three variables:

height_weight_age = [100,100,100]

clf = svm.SVC()
trainingData = np.vstack([males,females,nas])

clf.fit(trainingData, labels)

result = clf.predict(height_weight_age)

print result

Unfortunately, the following error occurs:

  ValueError: X.shape[1] = 3 should be equal to 15, the number of features at training time

How should I modify the trainingData and labels to get the correct answer?

Fred Foo · Accepted Answer

hstack gives 1-d arrays. You need 2-d arrays of shape (n_samples, n_features), which you can get from vstack.

In [7]: males = np.hstack([male_height,male_weight,male_age])

In [8]: males
Out[8]: 
array([111, 121, 137, 143, 157,  60,  70,  88,  99,  75,  41,  32,  73,
        54,  35])

In [9]: np.vstack([male_height,male_weight,male_age])
Out[9]: 
array([[111, 121, 137, 143, 157],
       [ 60,  70,  88,  99,  75],
       [ 41,  32,  73,  54,  35]])

In [10]: np.vstack([male_height,male_weight,male_age]).T
Out[10]: 
array([[111,  60,  41],
       [121,  70,  32],
       [137,  88,  73],
       [143,  99,  54],
       [157,  75,  35]])

You also need to pass a list/array of labels that reflects the label of each sample, rather than just enumerating the labels that exist. After fixing all of your variables, I can train an SVM and apply it as follows:

In [19]: clf = svm.SVC()

In [20]: y = ["male"] * 5 + ["female"] * 5 + ["na"] * 5

In [21]: X = np.vstack([males, females, nas])

In [22]: clf.fit(X, y)
Out[22]: 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [23]: height_weight_age = [100,100,100]

In [24]: clf.predict(height_weight_age)
Out[24]: 
array(['female'], 
      dtype='|S6')

(Note that I'm using string labels instead of numeric ones. I'd also advise you standardize the feature values, since they have rather different ranges.)

Value Error while running SVM in Sklearn

Answers (1)

Related Questions