Reputation: 4354
I'm trying to build a model from scratch that can classify MNIST images (handwritten digits). The model needs to output a list of probabilities representing how likely it is that the input image is a certain number.
This is the code I have so far:
from sklearn.datasets import load_digits
import numpy as np
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=0)
digits = load_digits()
features = digits.data
targets = digits.target
train_count = int(0.8 * len(features))
train_x = features[: train_count]
train_y = targets[: train_count]
test_x = features[train_count:]
test_y = targets[train_count:]
bias = np.random.rand()
weights = np.random.rand(len(features[0]))
rate = 0.02
for i in range(1000):
for i, sample in enumerate(train_x):
prod = np.dot(sample, weights) - bias
soft = softmax(prod)
predicted = np.argmax(soft) + 1
error = predicted - train_y[i]
weights -= error * rate * sample
bias -= rate * error
# print(error)
I'm trying to build the model so that it uses stochastic gradient descent but I'm a little confused as to what to pass to the softmax function. I understand it's supposed to expect a vector of numbers, but what I'm used to (when building a small NN) is that the model should produce one number, which is passed to an activation function, which in turn produces the prediction. Here, I feel like I'm missing a step and I don't know what it is.
Upvotes: 0
Views: 1438
Reputation: 53758
In the simplest implementation, your last layer (just before softmax) should indeed output a 10-dim vector, which will be squeezed to [0, 1]
by the softmax. This means that weights
should be a matrix of shape [features, 10]
and bias
should be a [10]
vector.
In addition to this, you should one-hot encode your train_y
labels, i.e. convert each item to [0, 0, ..., 1, ..., 0]
vector. The shape of train_y
is thus [size, 10]
.
Take a look at logistic regression example - it's in tensorflow, but the model is likely to be similar to yours: they use 768 features (all pixels), one-hot encoding for labels and a single hidden layer. They also use mini-batches to speed-up learning.
Upvotes: 1