Reputation: 679
I have my dataset in a TensorFlow Dataset
pipeline and I am wondering how can I normalize it, The problem is that in order to normalize you need to load your entire dataset which is the exact opposite of what the TensorFlow Dataset
is for.
So how exactly does one normalize a TensorFlow Dataset
pipeline? And how do I apply it to new data? (I.E. data used to make a new prediction)
Upvotes: 3
Views: 2553
Reputation: 14993
You do not need to normalise the entire dataset at once.
Depending on the type of data you work with, you can use a .map()
function whose sole purpose is to normalise that specific batch of data you are working with (for instance divide by 255.0 each pixel within an image.
You can use, for instance, map(preprocess_function_1).map(preprocess_function_2).batch(batch_size)
, where preprocess_function_1()
and preprocess_function_2()
are two different functions that preprocess a Tensor. If you use .batch(batch_size)
then the preprocessing functions are applied sequentially on batch_size
number of elements, you do not need to alter the entire dataset prior to using tf.data.Dataset()
Upvotes: 2
Reputation: 36614
There is no other than to iterate through the entire dataset once and collect the information you need. This is what they're doing in the Tensorflow documentation examples. For instance here they are getting all the words in order to tokenize the input:
tokenizer = tfds.features.text.Tokenizer()
vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
some_tokens = tokenizer.tokenize(text_tensor.numpy())
vocabulary_set.update(some_tokens)
vocab_size = len(vocabulary_set)
For normalization you would need to iterate through all the data and keep track of the mean, max, etc.
Upvotes: 2