user3306721
user3306721

Reputation: 55

Tensorflow / Keras: Normalize train / test / realtime Data or how to handle reality?

I started developing some LSTM-models and now have some questions about normalization.

Lets pretend I have some time series data that is roughly ranging between +500 and -500. Would it be more realistic to scale the Data from -1 to 1, or is 0 to 1 a better way, I tested it and 0 to 1 seemed to be faster. Is there a wrong way to do it? Or would it just be slower to learn?

Second question: When do I normalize the data? I split the data into training and testdata, do I have to scale / normalize this data seperately? maybe the trainingdata is only ranging between +300 to -200 and the testdata ranges from +600 to -100. Thats not very good I guess.

But on the other hand... If I scale / normalize the entire dataframe and split it after that, the data is fine for training and test, but how do I handle real new incomming data? The model is trained to scaled data, so I have to scale the new data as well, right? But what if the new Data is 1000? the normalization would turn this into something more then 1, because its a bigger number then everything else before.

To make a long story short, when do I normalize data and what happens to completely new data?

I hope I could make it clear what my problem is :D

Thank you very much!

Upvotes: 2

Views: 505

Answers (1)

Szymon Maszke
Szymon Maszke

Reputation: 24701

Would like to know how to handle reality as well tbh...

On a serious note though:

1. How to normalize data

Usually, neural networks benefit from data coming from Gaussian Standard distribution (mean 0 and variance 1).

Techniques like Batch Normalization (simplifying), help neural net to have this trait throughout the whole network, so it's usually beneficial.

There are other approaches that you mentioned, to tell reliably what helps for which problem and specified architecture you just have to check and measure.

2. What about test data?

Mean to subtract and variance to divide each instance by (or any other statistic you gather by any normalization scheme mentioned previously) should be gathered from your training dataset. If you take them from test, you perform data leakage (info about test distribution is incorporated into training) and you may get false impression your algorithm performs better than in reality.

So just compute statistics over training dataset and use them on incoming/validation/test data as well.

Upvotes: 2

Related Questions