Neuro Network on MNIST--Result is not expected

Question

After @IVlad gave me really useful feedback, I tried modifying my code, and the modified part would look like:

syn0 = (2*np.random.random((784,len(train_sample))) - 1)/8
syn1 = (2*np.random.random((len(train_sample),10)) - 1)/8


for i in xrange(10000):
    #forward propagation
    l0=train_sample
    l1=nonlin(np.dot(l0, syn0))
    l2=nonlin(np.dot(l1, syn1))

    #calculate error
    l2_error=train_tag_bool-l2

    if (i% 1000) == 0:
        print "Error:" + str(np.mean(np.abs(l2_error)))
    #apply sigmoid to the error 
    l2_delta = l2_error*nonlin(l2,deriv=True)

    l1_error = l2_delta.dot(syn1.T)
    l1_delta = l1_error * nonlin(l1,deriv=True)
    #update weights

    syn1 += alpha* (l1.T.dot(l2_delta) - beta*syn1)
    syn0 += alpha* (l0.T.dot(l1_delta) - beta*syn0)

Note that the tags (true label) now are in a matrix of <3000 x 10>, each row is a sample and the ten columns describes which digit each sample represents. (the train_tag_bool, now to think about it it's not really in boolean format so naming is kinda bad, but for the sake of the discussion I'll keep it this way for now.)

In this project, I'm using one hidden layer between input and output layers only, hoping it will be sufficient enough to complete the job. I have applied learning rate and weight decay, as well as making the initial weights a bit smaller.

I used the code from the website when calculating the error rate, which is

np.mean(np.abs(l2_error))

and the result came out to be 0.1. I'm not sure what to take from here.

Also, I went into the l2 layer (supposedly output layer that gives the prediction), and the values are all extremely small (<10^-9 for the largest value for each sample, and the smallest can reach 10^-85). This is after only 5 iterations, though, but I doubt things will be any different had I run it for 1k loops or more. If I return the max of each row, it's always the 9th element (represents digit '9'), which is totally wrong.

I'm stuck again on this problem. Overflow problem is and has been the biggest challenge of my whole ML experience (back then MATLAB, not Numpy), and I've yet to find a way to deal with it.....

train_tag_bool code:

train_tag_bool=np.array([[0]*10]*len(train_tag)).astype('float64')
for i in range(len(train_tag)):
    if train_tag[i]==0:
        train_tag_bool[i][0]=1
    elif train_tag[i]==1:
        train_tag_bool[i][1]=1
    elif train_tag[i]==2:
        train_tag_bool[i][2]=1
    elif train_tag[i]==3:
        train_tag_bool[i][3]=1
    elif train_tag[i]==4:
        train_tag_bool[i][4]=1
    elif train_tag[i]==5:
        train_tag_bool[i][5]=1
    elif train_tag[i]==6:
        train_tag_bool[i][6]=1
    elif train_tag[i]==7:
        train_tag_bool[i][7]=1
    elif train_tag[i]==8:
        train_tag_bool[i][8]=1
    elif train_tag[i]==9:
        train_tag_bool[i][9]=1

Brute force, I know, but that's the least of my concern right now. The result is a 3000 x 10 matrix with 1's corresponding to what the digit is for each sample. the first element represents digit 0, the last represents 9

ex. [0 0 0 0 0 0 1 0 0 0] represents 6, [1 0 0 0 0 0 0 0 0 0] represents 0.

The original code:

import cPickle, gzip
import numpy as np

#from deeplearning.net
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()





#sigmoid function
def nonlin(x, deriv=False):
    if (deriv ==True):
        return x*(1-x)
    return 1/(1+np.exp(-x))

#seed random numbers to make calculation
#deterministic (just a good practice)

np.random.seed(1)




#need to decrease the sample size or else computer dies
train_sample=train_set[0][0:3000]
train_tag=train_set[1][0:3000]
train_tag=train_tag.reshape(len(train_tag), 1)

#train_set's dimension for the pixels are 50000(samples) x 784 (28x28 for each sample)
#therefore the coefficients should be 784x50000 to make the hidden layer 50k x 50k

syn0 = 2*np.random.random((784,len(train_sample))) - 1
syn1 = 2*np.random.random((len(train_sample),1)) - 1


for i in xrange(10000):
    #forward propagation
    l0=train_sample
    l1=nonlin(np.dot(l0, syn0))
    l2=nonlin(np.dot(l1, syn1))

    #calculate error
    l2_error=train_tag-l2

    if (i% 1000) == 0:
        print "Error:" + str(np.mean(np.abs(l2_error)))
    #apply sigmoid to the error 
    l2_delta = l2_error*nonlin(l2,deriv=True)

    l1_error = l2_delta.dot(syn1.T)
    l1_delta = l1_error * nonlin(l1,deriv=True)
    #update weights

    syn1 += l1.T.dot(l2_delta)
    syn0 += l0.T.dot(l1_delta)

Reference:

http://iamtrask.github.io/2015/07/12/basic-python-network/

http://yann.lecun.com/exdb/mnist/

IVlad · Accepted Answer

I can't currently run the code, but there are a few things that stand out. I'm surprised it works well even on the toy problems used on the blog.

Before we start, you'll need more output neurons: 10 to be exact.

syn1 = 2*np.random.random((len(train_sample), 10)) - 1

And your labels (y) better by a length 10 array with a 1 at the position of the correct digit and 0 elsewhere.

First of all, one thing I always attempt by default is to use float64 wherever possible... which almost never changes anything, so I'm not sure if you should get into this habit or not. Probably not.

Second, that code has no learning rate that you can set. This means that the learning rate is implicitly 1, which is huge for your problem, where people use 0.01 or even much less. To add a learning rate alpha, do:

syn1 += alpha * l1.T.dot(l2_delta)
syn0 += alpha * l0.T.dot(l1_delta)

And set it to at most 0.01. You'll have to fiddle with it for best results.

Third, it's usually better to initialize the net with small weights. [0, 1) might be too big. Try:

syn0 = (np.random.random((784,len(train_sample))) - 0.5) / 4
syn1 = (np.random.random((len(train_sample),1)) - 0.5) / 4

There are more involved initialization schemes that you can search for if you're interested, but I've gotten decent results with the above.

Fourth, regularization. The easiest to implement is probably weight decay. Implementing weight decay lambda can be done like this:

syn1 += alpha * l1.T.dot(l2_delta) - alpha * lambda * syn1
syn0 += alpha * l0.T.dot(l1_delta) - alpha * lambda * syn0

Common values are also < 0.1 or even < 0.01.

Dropout can also help, but it's a bit harder to implement and understand if you're just starting out, in my opinion. It's also more useful for deeper nets AFAIK. So maybe leave this for last.

Fifth, maybe also use momentum (explained in the weight decay link), which should decrease the learning time for your network. Also tune the number of iterations: you don't want too many, but not too few either.

Sixth, look into softmax for the output layer.

Seventh, look into tanh instead of your current nonlin sigmoid function.

If you apply these incrementally, you should start getting some meaningful results. I think regularization and smaller initial weights should help with the overflow errors.

Update:

I have changed the code like this. After only 100 training epochs, accuracy is 84.79%. Not too bad with barely tweaking anything.

I have added bias neurons, momentum, weight decay, used fewer hidden units (was way too slow with what you had), changed to tanh function and a few others.

You should be able to tweak it some more from here. I use Python 3.4, so I had to change a few things to get it to run, but it's nothing major.

import pickle, gzip
import numpy as np

#from deeplearning.net
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
f.close()





#sigmoid function
def nonlin(x, deriv=False):
    if (deriv ==True):
        return 1-x*x
    return np.tanh(x)

#seed random numbers to make calculation
#deterministic (just a good practice)

np.random.seed(1)

def make_proper_pairs_from_set(data_set):
    data_set_x, data_set_y = data_set

    data_set_y = np.eye(10)[:, data_set_y].T

    return data_set_x, data_set_y


train_x, train_y = make_proper_pairs_from_set(train_set)
train_x = train_x
train_y = train_y

test_x, test_y = make_proper_pairs_from_set(test_set)

print(len(train_y))

#train_set's dimension for the pixels are 50000(samples) x 784 (28x28 for each sample)
#therefore the coefficients should be 784x50000 to make the hidden layer 50k x 50k

# changed to 200 hidden neurons, should be plenty
syn0 = (2*np.random.random((785,200)) - 1) / 10
syn1 = (2*np.random.random((201,10)) - 1) / 10

velocities0 = np.zeros(syn0.shape)
velocities1 = np.zeros(syn1.shape)

alpha = 0.01
beta = 0.0001
momentum = 0.99

m = len(train_x) # number of training samples

# moved the forward propagation to a function and added bias neurons
def forward_prop(set_x, m):

    l0 = np.c_[np.ones((m, 1)), set_x]

    l1 = nonlin(np.dot(l0, syn0))
    l1 = np.c_[np.ones((m, 1)), l1]

    l2 = nonlin(np.dot(l1, syn1))


    return l0, l1, l2, l2.argmax(axis=1)

num_epochs = 100
for i in range(num_epochs):
    # forward propagation

    l0, l1, l2, _ = forward_prop(train_x, m)

    # calculate error
    l2_error = l2 - train_y


    print("Error " + str(i) + ": " + str(np.mean(np.abs(l2_error))))
    # apply sigmoid to the error 
    l2_delta = l2_error * nonlin(l2,deriv=True)

    l1_error = l2_delta.dot(syn1.T)
    l1_delta = l1_error * nonlin(l1,deriv=True)
    l1_delta = l1_delta[:, 1:]

    # update weights
    # divide gradients by the number of samples
    grad0 = l0.T.dot(l1_delta) / m
    grad1 = l1.T.dot(l2_delta) / m

    v0 = velocities0
    v1 = velocities1

    velocities0 = velocities0 * momentum - alpha * grad0
    velocities1 = velocities1 * momentum - alpha * grad1


    # divide regularization by number of samples
    # because L2 regularization reduces to this
    syn1 += -v1 * momentum + (1 + momentum) * velocities1 - alpha * beta * syn1 / m
    syn0 += -v0 * momentum + (1 + momentum) * velocities0 - alpha * beta * syn0 / m



# find accuracy on test set

predictions = []
corrects = []
for i in range(len(test_x)): # you can eliminate this loop too with a bit of work, but this part is very fast anyway
    _, _, _, rez = forward_prop([test_x[i, :]], 1)

    predictions.append(rez[0])
    corrects.append(test_y[i].argmax())

predictions = np.array(predictions)
corrects = np.array(corrects)

print(np.sum(predictions == corrects) / len(test_x))

Update 2:

If you increase the learning rate to 0.05 and the epochs to 1000, you get 95.43% accuracy.

Seeding the random number generator with the current time, adding more hidden neurons (or hidden layers) and more parameter tweaks can get this simple model to about 98% accuracy AFAIK. The problem is that it's slow to train.

Also, this methodology isn't really sound. I optimized the parameters to increase the accuracy on the test set, so I might be overfitting the test set. You should use cross validation or the validation set.

Anyway, as you can see, there are no overflow errors. If you want to discuss things in more detail, feel free to drop me an e-mail (address in profile).

Neuro Network on MNIST--Result is not expected

ex. [0 0 0 0 0 0 1 0 0 0] represents 6, [1 0 0 0 0 0 0 0 0 0] represents 0.

Answers (1)

Related Questions