Reputation: 193
I am writing my master thesis about how to apply LSTM neural network in time series. In my experiment, i found out that scaling data can have a great impact on the result. For example, when i use a tanh activation function, and the value range is between -1 and 1, the model seems to converge faster and the validation error also does not jump dramatically after each epoch.
Does anyone know is there any mathmetical explanation for that? Or is there any papers already explain about this situation?
Upvotes: 9
Views: 10083
Reputation: 7130
Your question reminds me of a picture used in our class, but you can find a similar one from here at 3:02.
In the picture above you can see obviously that the path on the left is much longer than that on the right. The scaling is applied to the left to become the right one.
Upvotes: 11
Reputation: 9422
may the point is nonlinearity. my approach is from chaos theory ( fractals , multifractals,... ) and the range of input and parameter values of a nonlinear dynamical system have strong influence on the system behavior. this is because of the nonlinearity, in case of tanh
the type of nonlinearity in the interval [-1,+1] is different than in other intervals, i.e. in the range [10,infinity) it is approx. a constant.
any nonlinear dynamical system is only valid in a specific range for both parameters and initial value, see i.e. the logistic map. Depending on the range of parameter values and initial values the behavior of the logistic map is completely different, this is the sensitivity to initial conditions RNNs can be regarded as nonlinear self-referential systems.
in general there are some remarkable similarities between nonlinear dynamical systems and neural networks, i.e. the fading memory property of Volterra series models in Nonlinear Systems Identification and the vanishing gradient in recurrent neural networks
strongly chaotic systems have the sensitivity to initial conditions property and it is not possible to reproduce this heavily nonlinear behavior neither by Volterra series nor by RNNs because of the fading memory, resp. the vanishing gradient
so the mathematical background could be that a nonlinearity is more 'active' in the range of a specific intervall while linearity is equally active anywhere ( it is linear or approx constant )
in the context of RNNs and monofractality / multifractality scaling has two different meanings. This is especially confusing because RNNs and nonlinear, self-referential systems are deeply linked
in the context of RNNs scaling means a limiting of the range of input or output values in the sense of an affine transformation
in context of monofractality / multifractality scaling means that the output of the nonlinear system has a specific structure that is scale invariant in case of monofractals, self-affine in case of self-affine fractals ... where the scale is equivalent to a 'zoom level'
The link between RNNs and nonlinear self-referential systems is that they are both exactly that, nonlinear and self-referential.
in general sensitivity to initial conditions ( which is related to the sensitivity to scaling in RNNs ) and scale invariance in the resulting structures ( output ) only appears in nonlinear self-referential systems
the following paper is a good summary for multifractal and monofractal scaling in the output of a nonlinear self-referential system ( not to be confused with the scaling of input and output of RNNs ) : http://www.physics.mcgill.ca/~gang/eprints/eprintLovejoy/neweprint/Aegean.final.pdf
in this paper is a direct link between nonlinear systems and RNN : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4107715/ - Nonlinear System Modeling with Random Matrices: Echo State Networks Revisited
Upvotes: 1