Reputation: 443
Often, to improve learning rates, inputs to a neural network are preprocessed by scaling and shifting to be between -1 and 1. I'm wondering though if that's a good idea with an input whose graph would be exponentially decaying. For instance, if I had an input with integer values 0 to 100 distributed with the majority of inputs being 0 and smaller values being more common than large values, with 99 being very rare.
Seems that scaling them and shifting wouldn't be ideal, since now the most common value would be -1. How is this type of input best dealt with?
Upvotes: 0
Views: 1495
Reputation: 11005
Consider you're using a sigmoid activation function which is symmetric around the origin:
The trick to speed up convergence is to have the mean of the normalized data set be 0 as well. The choice of activation function is important because you're not only learning weights from the input to the first hidden layer, i.e. normalizing the input is not enough: the input to the second hidden layer/output is learned as well and thus needs to obey the same rule to be consequential. In the case of non-input layers this is done by the activation function. The much cited Efficient Backprop paper by Lecun summarizes these rules and has some nice explanations as well which you should look up. Because there's other things like weight and bias initialization that one should consider as well.
In chapter 4.3 he gives a formula to normalize the inputs in a way to have the mean close to 0 and the std deviation 1. If you need more sources, this is great faq as well.
I don't know your application scenario but if you're using symbolic data and 0-100 is ment to represent percentages, then you could also apply softmax to the input layer to get better input representations. It's also worth noting that some people prefer scaling to [.1,.9] instead of [0,1]
Edit: Rewritten to match comments.
Upvotes: 1