Yuriy Zaletskyy
Yuriy Zaletskyy

Reputation: 5151

How to prevent NN from forgetting old data

I've implemented NN for OCR. My program had quite good percentage of successful recognitions, but recently ( two months ago ) it's performance decreased at ~23%. After analyzing data, I noticed that some some new irregularities in images appeared ( additional twistings, noise ). In other words, my nn needed to learn some new data, but also it was needed to make sure, that it will not forget old data. In order to achieve it I trained NN on mixture of old and new data and very tricky feature that I tried was prevent weights from changing to much ( initially I limited changes not more then 3%, but later accepted 15%). What else can be done in order to help NN not to "forget" old data?

Upvotes: 0

Views: 602

Answers (1)

Andnp
Andnp

Reputation: 674

This question is the subject of current, active research.

It sounds to me as if your original implementation had over-learned from its original dataset, making it unable to effectively generalize for new data. There are many techniques available to prevent this from happening:

  1. Make sure that your network is the smallest size that can still solve the problem.
  2. Use some form of regularization technique. One of my favorites (and the current favorite of researchers) is the dropout technique. Basically every time you feed forward, every neuron has a percent chance of returning 0 instead of its typical activation. Other common techniques include L1, L2 and weight decay.
  3. Play with your learning constant. Perhaps your constant is too high.
  4. Finally continue training in the way you described. Create a buffer of all datapoints (new and old) and train on randomly chosen points in a random order. This will help make sure your network is not falling into a local minimum.

Personally I would try these techniques before trying to limit how a neuron can learn on every iteration. If you are using Sigmoid or Tanh activation, then values around .5 (sigmoid) or 0 (tanh) will have a large derivative and will change rapidly which is one of the advantages of these activations. To achieve a similar, but less obtrusive effect: play with your learning constant. I'm not sure of the size of your net, or the amount of samples you have, but try a learning constant of ~.01

Upvotes: 1

Related Questions