how does one normalize a TensorFlow `Dataset` pipeline?

I have my dataset in a TensorFlow Dataset pipeline and I am wondering how can I normalize it, The problem is that in order to normalize you need to load your entire dataset which is the exact opposite of what the TensorFlow Dataset is for.

So how exactly does one normalize a TensorFlow Dataset pipeline? And how do I apply it to new data? (I.E. data used to make a new prediction)

Upvotes: 3

Answers (2)

Timbus Calin

Reputation: 14993

You do not need to normalise the entire dataset at once.

Depending on the type of data you work with, you can use a .map() function whose sole purpose is to normalise that specific batch of data you are working with (for instance divide by 255.0 each pixel within an image.

You can use, for instance, map(preprocess_function_1).map(preprocess_function_2).batch(batch_size), where preprocess_function_1() and preprocess_function_2() are two different functions that preprocess a Tensor. If you use .batch(batch_size) then the preprocessing functions are applied sequentially on batch_size number of elements, you do not need to alter the entire dataset prior to using tf.data.Dataset()

Upvotes: 2

Nicolas Gervais

Reputation: 36614

There is no other than to iterate through the entire dataset once and collect the information you need. This is what they're doing in the Tensorflow documentation examples. For instance here they are getting all the words in order to tokenize the input:

tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
  some_tokens = tokenizer.tokenize(text_tensor.numpy())
  vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)

For normalization you would need to iterate through all the data and keep track of the mean, max, etc.

Upvotes: 2

how does one normalize a TensorFlow `Dataset` pipeline?

Answers (2)

Related Questions