tdh
tdh

Reputation: 872

3-Layer Neural Network Getting Stuck in Local Minima

I've programmed a 3-Layer Neural Network in Python, based on this tutorial, to play Rock, Paper, Scissors, with sample data using -1 for rock, 0 for paper, and 1 for scissors, and similar arrays to that which are in the tutorial. My function seems to be getting stuck in a relative minima with every run, and I'm looking for a way to to remedy this. The program is below.

#math module
import numpy as np

#sigmoid function converts numbers to percentages(between 0 and 1)
def nonlin(x, deriv = False):
    if (deriv == True): #sigmoid derivative is just
        return x*(1-x)#output * (output - 1)

    return 1/(1+np.exp(-x)) #print the sigmoid function

#input data: using MOCK RPS DATA, -1:ROCK, 0:PAPER, 1:SCISSORS
input_data = np.array([[1, 1, 1],
                    [0, 0, 0],
                    [-1, -1, -1],
                    [-1, 1, -1]])
#also for training
output_data = np.array([[1],
                    [0],
                    [-1],
                    [1]])

#random numbers to not get stuck in local minima for fitness
np.random.seed(1)

#create random weights to be trained in loop
firstLayer_weights = 2*np.random.random((3, 4)) - 1 #size of matrix
secondLayer_weights = 2*np.random.random((4, 1)) - 1

for value in xrange(60000): # loops through training

    #pass input through weights to output: three layers
    layer0 = input_data
    #layer1 takes dot product of the input and weight matrices, then maps them to sigmoid function
    layer1 = nonlin(np.dot(layer0, firstLayer_weights))
    #layer2 takes dot product of layer1 result and weight matrices, then maps the to sigmoid function
    layer2 = nonlin(np.dot(layer1, secondLayer_weights))

    #check computer predicted result against actual data
    layer2_error = output_data - layer2

    #if value is a factor of 10,000, so six times (out of 60,000),
    #print how far off the predicted value was from the data
    if value % 10000 == 0:
        print "Error:" + str(np.mean(np.abs(layer2_error))) #average error

    #find out how much to re-adjust weights based on how far off and how confident the estimate
    layer2_change = layer2_error * nonlin(layer2, deriv = True)

    #find out how layer1 led to error in layer 2, to attack root of problem
    layer1_error = layer2_change.dot(secondLayer_weights.T)
    #^^sends error on layer2 backwards across weights(dividing) to find original error: BACKPROPAGATION

    #same thing as layer2 change, change based on accuracy and confidence
    layer1_change = layer1_error * nonlin(layer1, deriv=True)

    #modify weights based on multiplication of error between two layers
    secondLayer_weights = secondLayer_weights + layer1.T.dot(layer2_change)
    firstLayer_weights = firstLayer_weights + layer0.T.dot(layer1_change)

As you can see, this section is the data involved:

input_data = np.array([[1, 1, 1],
                       [0, 0, 0],
                       [-1, -1, -1],
                       [-1, 1, -1]])
#also for training
output_data = np.array([[1],
                        [0],
                        [-1],
                        [1]])

And the weights are here:

firstLayer_weights = 2*np.random.random((3, 4)) - 1 #size of matrix
secondLayer_weights = 2*np.random.random((4, 1)) - 1

It seems that after the first thousand generations, the weights correct with minimal efficiency for the remainder of the compiling, leading me to believe they've reached a relative minima, as shown here:

Relative minima point for weights

What is an quick and efficient alternative to rectify this issue?

Upvotes: 0

Views: 1789

Answers (2)

mrry
mrry

Reputation: 126184

One issue with your network is that the output (the value of the elements of layer2) can only vary between 0 and 1, because you're using a sigmoid nonlinearity. Since one of your four target values is -1 and the closest possible prediction is 0, there will always be at least 25% error. Here are a few suggestions:

  1. Use a one-hot encoding for the outputs: i.e. have three output nodes—one for each of ROCK, PAPER and SCISSORS—and train the network to compute a probability distribution across these outputs (typically using softmax and cross-entropy loss).

  2. Make the output layer of your network a linear layer (apply weights and biases, but not a nonlinearity). Either add another layer, or remove the nonlinearity from your current output layer.

Other things you could try, but are less likely to work reliably, since really you are dealing with categorical data rather than a continuous output:

  1. Scale your data so that all of the outputs in the training data are between 0 and 1.

  2. Use a non-linearity that produces values between -1 and 1 (such as tanh).

Upvotes: 6

www.data-blogger.com
www.data-blogger.com

Reputation: 4164

Add a little noise to the weights after each iteration. This will get your program out of the local minimum and improve (if possible). There is quite some literature about this. Look for example at http://paper.ijcsns.org/07_book/200705/20070513.pdf.

Upvotes: 0

Related Questions