Anas EL KORCHI
Anas EL KORCHI

Reputation: 2048

How to test if my implementation of back propagation neural Network is correct

I am working on an implementation of the back propagation algorithm. What I have implemented so far seems working but I can't be sure that the algorithm is well implemented, here is what I have noticed during training test of my network :

Specification of the implementation :

When I run the back propagation training process:

Upvotes: 4

Views: 1862

Answers (1)

Lukasz Tracewski
Lukasz Tracewski

Reputation: 11387

The short answer would be "no, very likely your implementation is incorrect". Your network is not training as can be observed by the very high cost of error. As discussed in comments, your network suffers very heavily from vanishing gradient problem, which is inevitable in deep networks. In essence, the first layers of you network learn much slower than the later. All neurons get some random weights at the beginning, right? Since the first layer almost doesn't learn anything, the large initial error propagates through the whole network!

How to fix it? From the description of your problem it seems that a feedforward network with just a single hidden layer in should be able to do the trick (as proven in universal approximation theorem).

Check e.g. free online book by Michael Nielsen if you'd like to learn more.

so I do understand from that the back propagation can't deal with deep neural networks? or is there some method to prevent this problem?

It can, but it's by no mean a trivial challenge. Deep neural networks have been used since 60', but only in 90' researchers came up with methods how to deal with them efficiently. I recommend reading "Efficient BackProp" chapter (by Y.A. LeCun et al.) of "Neural Networks: Tricks of the Trade".

Here is the summary:

  • Shuffle the examples
  • Center the input variables by subtracting the mean
  • Normalize the input variable to a standard deviation of 1
  • If possible, decorrelate the input variables.
  • Pick a network with the sigmoid function f(x)=1.7159*(tanh(2/3x): it won't saturate at +1 / -1, but instead will have highest gain at these points (second derivative is at max.)
  • Set the target values within the range of the sigmoid, typically +1 and -1.
  • The weights should be randomly drawn from a distribution with mean zero and a standard deviation given by m^(-1/2), where m is the number of inputs to the unit

The preferred method for training the network should be picked as follows:

  • If the training set is large (more than a few hundred samples) and redundant, and if the task is classification, use stochastic gradient with careful tuning, or use the stochastic diagonal Levenberg Marquardt method.
  • If the training set is not too large, or if the task is regression, use conjugate gradient.

Also, some my general remarks:

  • Watch for numerical stability if you implement it yourself. It's easy to get into troubles.
  • Think of the architecture. Fully-connected multi-layer networks are rarely a smart idea. Unfortunately ANN are poorly understood from theoretical point of view and one of the best things you can do is just check what worked for others and learn useful patterns (with regularization, pooling and dropout layers and such).

Upvotes: 3

Related Questions