Magic-Ravioli
Magic-Ravioli

Reputation: 23

Non-converging Neural Network in C

I wrote my first feed-forward neural network in C, using the sigmoid 1.0 / (1.0 + exp(-x)) as activation function and gradient descent to adjust the weights. I tried to approximate sin(x) to make sure my network works. However, the output of the neuron on the output layer seems to always oscillate between the extreme values 0 and 1 and the weights of the neurons grow to absurd sizes, no matter how many hidden layers there are, how many neurons are in the hidden layer(s), how many training samples I provide, or even what the target outputs are.

1) Are there any standard 'tried and tested' data sets used to proof-test neural networks for errors? If yes, what structures work best (e.g. numbers of neuron(s) in the hidden layer) to converge to the desired output?

2) Are there any common errors that generate the same symptoms? I found this thread, but the issue was because of faulty data, which I believe is not my case.

3) Is there any preferred way of training the network? In my implementation I cycle through the training sets and adjust the weights each time, then rinse and repeat ~1000 times. Is there any other order that works better?

Upvotes: 2

Views: 197

Answers (1)

Marcin Możejko
Marcin Możejko

Reputation: 40516

So, to sum up:

  1. Assuming that your gradient propagation works properly usually the values of parameters like topology, learning rate, batch size or value of a constant connected with weight penalty (L1 and L2 decay) are computed using a techniques called grid search or random search. It was empirically proved that random search performs better in this task.

  2. The most common reason of weight divergence is wrong learning rate. Big value of it might make learning really hard. But on the other hand - when learning rate is too small - learning process might take a really long time. Usually - you should babysit the learning phase. The specified instruction might be found e.g. here.

  3. In your learning phase you used a technique called SGD. Usually - it may achieve good results but it's vulnerable to variance of data sets and big values of learning rates. What I advice you is to use batch learning and set a batch size as additional learning parameter learnt during grid or random search. You can read about here e.g. here.

  4. Another thing which you might consider is to change your activation function to tanh or relu. There are a lot of problems with saturation regions of sigmoid and it usually needs a proper initialization. You can read about it here.

Upvotes: 3

Related Questions