tudor balus
tudor balus

Reputation: 149

Strange convergence in simple Neural Network

I've been struggling for some time with building a simplistic NN in Java. I've been working on and off on this project for a few months and I wanna finish it. My main issue is that I dunno how to implement backpropagation correctly (all sources use Python, math jargon, or explain the idea too briefly). Today I tried deducing the ideology by myself and the rule that I'm using is:

the weight update = error * sigmoidDerivative(error) * weight itself;
error = output - actual; (last layer)
error = sigmoidDerivative(error from previous layer) * weight attaching this neuron to the neuron giving the error (intermediary layer)

My main problems are that the outputs converge towards an average value and my secondary problem is that the weights get updated towards an extremely weird value. (probably the weights issue is causing the convergence)

What I'm trying to train: for inputs 1-9 , the expected output is: (x*1.2+1)/10. This is just a rule that came to me randomly. I'm using a NN with the structure 1-1-1 (3 layers, 1 network/ layer). In the link bellow I attached two runs: one in which I'm using the training set that follows the rule (x*1.2+1)/10 and in the other I'm using (x*1.2+1)/100. With the division by 10, the first weight goes towards infinity; with the division by 100, the second weight tends towards 0.I kept trying to debug it but I have no idea what I should be looking for or what's wrong. Any suggestions are much appreciated. Thank you in advance and a great day to you all!

https://wetransfer.com/downloads/55be9e3e10c56ab0d6b3f36ad990ebe120171210162746/1a7b6f

I have as training samples 1->9 and their respective outputs by following the rule above and I run them for 100_000 epochs. I log the the error every 100 epochs since it's easier to plot with less datapoints, while still having 1000 datapoints for each expected output of the 9. Code for backpropagation and weight updates:

    //for each layer in the Dweights array
    for(int k=deltaWeights.length-1; k >= 0; k--)
    {
        for(int i=0; i<deltaWeights[k][0].length; i++)     // for each neuron in the layer
        {
            if(k == network.length-2)      // if we're on the last layer, we calculate the errors directly
            {
                outputErrors[k][i] = outputs[i] - network[k+1][i].result;
                errors[i] = outputErrors[k][i];
            }
            else        // otherwise the error is actually the sum of errors feeding backwards into the neuron currently being processed * their respective weight
            {
                for(int j=0; j<outputErrors[k+1].length; j++)
                {                         // S'(error from previous layer) * weight attached to it
                    outputErrors[k][i] += sigmoidDerivative(outputErrors[k+1][j])[0] * network[k+1][i].emergingWeights[j];
                }
            }
        }

        for (int i=0; i<deltaWeights[k].length; i++)           // for each neuron
        {
            for(int j=0; j<deltaWeights[k][i].length; j++)     // for each weight attached to that respective neuron
            {                        // error                S'(error)                                  weight connected to respective neuron                
                deltaWeights[k][i][j] = outputErrors[k][j] * sigmoidDerivative(outputErrors[k][j])[0] * network[k][i].emergingWeights[j];
            }
        }
    }

    // we use the learning rate as an order of magnitude, to scale how drastic the changes in this iteration are
    for(int k=deltaWeights.length-1; k >= 0; k--)       // for each layer
    {
        for (int i=0; i<deltaWeights[k].length; i++)            // for each neuron
        {
            for(int j=0; j<deltaWeights[k][i].length; j++)     // for each weight attached to that respective neuron
            {
                deltaWeights[k][i][j] *=  1;       // previously was learningRate; MSEAvgSlope

                network[k][i].emergingWeights[j] += deltaWeights[k][i][j];
            }
        }
    }

    return errors;

Edit: a quick question that comes to mind: since I'm using sigmoid as my activation function, should my input and output neurons be only between 0-1? My output is between 0-1 but my inputs literally are 1-9.

Edit2: normalized the input values to be 0.1-0.9 and changed:

    outputErrors[k][i] += sigmoidDerivative(outputErrors[k+1][j])[0] * network[k+1][i].emergingWeights[j];     

to:

    outputErrors[k][i] = sigmoidDerivative(outputErrors[k+1][j])[0] * network[k+1][i].emergingWeights[j]* outputErrors[k+1][j];       

so that I keep the sign of the output error itself. This repaired the Infinity tendency in the first weight. Now, with the /10 run, the first weight tends to 0 and with the /100 run, the second weight tends to 0. Still hoping that someone will bud in to clear things up for me. :(

Upvotes: 1

Views: 416

Answers (1)

Bastian
Bastian

Reputation: 1593

I've seen serval problems with your code like your weight updates are incorrect for example. I'd also strongly recommend you to organize your code cleaner by introducing methods.

Backpropagation is usually hard to implement efficiently but the formal definitions are easily translated into any language. I'd not recommend you to look at code for studying neural nets. Look at the math and try to understand that. This makes you way more flexible about implementing one from scratch.

I can give you some hints by describing the forward and backward pass in pseudo code.

As a matter of notation, I use i for the input, j for the hidden and k for the output layer. The bias of the input layer is then bias_i. The weights are w_mn for the weights connecting one node to another. The activation is a(x) and it's derivative a'(x).

Forward pass:

for each n of j
       dot = 0
       for each m of i
              dot += m*w_mn
       n = a(dot + bias_i)

The identical applies for the output layer k and the hidden layer j. Hence, just replace j by k and i by j for the this step.

Backward pass:

Calculate delta for output nodes:

for each n of k
       d_n = a'(n)(n - target)

Here, target is the expected output, n the output of the current output node. d_n is the delta of this node. An important note here is, that the derivatives of the logistic and the tanh function contain the output of the original function and this values don't have to be reevaluated. The logistic function is f(x) = 1/(1+e^(-x)) and it's derivative f'(x) = f(x)(1-f(x)). Since the value at each output node n was previously evaluated with f(x), one can simply apply n(1-n) as the derivative. In the case above this would calculate the delta as follows:

d_n = n(1-n)(n - target)

In the same fashion, calculate the deltas for the hidden nodes.

for each n of j
      d_n = 0
      for each m of k
             d_n += d_m*w_jk
      d_n = a'(n)*d_n

Next step is to perform the weight update using the gradients. This is done by an algorithm called gradient descent. Without going into much detail, this can be accomplished as follows:

for each n of j
      for each m of k
            w_nm -= learning_rate*n*d_m

Same applies for the layer above. Just replace j by i and k by j.

To update the biases, just sum up the deltas of the connected nodes, multiply this by the learning rate and subtract this product from the specific bias.

Upvotes: 1

Related Questions