linthum
linthum

Reputation: 363

Sklearn digits dataset

import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn import svm

digits = datasets.load_digits()

print(digits.data)

classifier = svm.SVC(gamma=0.4, C=100)
x, y = digits.data[:-1], digits.target[:-1]

x = x.reshape(1,-1)
y = y.reshape(-1,1)
print((x))

classifier.fit(x, y)
###
print('Prediction:', classifier.predict(digits.data[-3]))
###
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

I have reshaped the x and y as well. Still I'm getting an error saying :

Found input variables with inconsistent numbers of samples: [1, 1796]

Y has 1-d array with 1796 elements whereas x has many. How does it show 1 for x?

Upvotes: 2

Views: 5761

Answers (2)

DJanssens
DJanssens

Reputation: 20709

The reshaping will transform your 8x8 matrix to a 1-dimensional vector, which can be used as a feature. You need to reshape the entire X vector, not only those of the training data, since the one's you will use for prediction need to have the same format.

The following code shows how:

import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn import svm

digits = datasets.load_digits()


classifier = svm.SVC(gamma=0.4, C=100)
x, y = digits.images, digits.target

#only reshape X since its a 8x8 matrix and needs to be flattened
n_samples = len(digits.images)
x = x.reshape((n_samples, -1))
print("before reshape:" + str(digits.images[0]))
print("After reshape" + str(x[0]))


classifier.fit(x[:-2], y[:-2])
###
print('Prediction:', classifier.predict(x[-2]))
###
plt.imshow(digits.images[-2], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

###
print('Prediction:', classifier.predict(x[-1]))
###
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

It will output:

before reshape:[[  0.   0.   5.  13.   9.   1.   0.   0.]
 [  0.   0.  13.  15.  10.  15.   5.   0.]
 [  0.   3.  15.   2.   0.  11.   8.   0.]
 [  0.   4.  12.   0.   0.   8.   8.   0.]
 [  0.   5.   8.   0.   0.   9.   8.   0.]
 [  0.   4.  11.   0.   1.  12.   7.   0.]
 [  0.   2.  14.   5.  10.  12.   0.   0.]
 [  0.   0.   6.  13.  10.   0.   0.   0.]]
After reshape[  0.   0.   5.  13.   9.   1.   0.   0.   0.   0.  13.  15.  10.  15.   5.
   0.   0.   3.  15.   2.   0.  11.   8.   0.   0.   4.  12.   0.   0.   8.
   8.   0.   0.   5.   8.   0.   0.   9.   8.   0.   0.   4.  11.   0.   1.
  12.   7.   0.   0.   2.  14.   5.  10.  12.   0.   0.   0.   0.   6.  13.
  10.   0.   0.   0.]

And a correct prediction for the last 2 images, which weren't used for training - you can decide however to make a bigger split between testing and training set.

Upvotes: 0

SCB
SCB

Reputation: 6139

Actually scrap what I suggested below:

This link describes the general dataset API. The attribute data is a 2d array of each image, already flattened:

import sklearn.datasets
digits = sklearn.datasets.load_digits()
digits.data.shape
#: (1797, 64)

This is all you need to provide, no reshaping required. Similarly, the attribute data is a 1d array of each label:

digits.data.shape
#: (1797,)

No reshaping necessary. Just split into training and testing and run with it.


Try printing x.shape and y.shape. I feel that you're going to find something like: (1, 1796, ...) and (1796, ...) respectively. When calling fit for classifiers in scikit it expects two identically shaped iterables.

The clue, why are the arguments when reshaping different ways around:

x = x.reshape(1, -1)
y = y.reshape(-1, 1)

Maybe try:

x = x.reshape(-1, 1)

Completely unrelated to your question, but you're predicting on digits.data[-3] when the only element left out of the training set is digits.data[-1]. Not sure if that was intentional.

Regardless, it could be good to check your classifier over more results using the scikit metrics package. This page has an example of using it over the digits dataset.

Upvotes: 1

Related Questions