Reputation: 33385
I'm running some experiments with neural networks in TensorFlow. The release notes for the latest version say DataSet is henceforth the recommended API for supplying input data.
In general, when taking numeric values from the outside world, the range of values needs to be normalized; if you plug in raw numbers like length, mass, velocity, date or time, the resulting problem will be ill-conditioned; it's necessary to check the dynamic range of values and normalize to the range (0,1)
or (-1,1)
.
This can of course be done in raw Python. However, DataSet provides a number of data transformation features and encourages their use, on the theory that the resulting code will not only be easier to maintain, but run faster. That suggests there should also be a built-in feature for normalization.
Looking over the documentation at https://www.tensorflow.org/programmers_guide/datasets however, I'm not seeing any mention of such. Am I missing something? What is the recommended way to do this?
Upvotes: 4
Views: 1776
Reputation: 53758
My understanding of tensorflow datasets main idea tells me that complex pre-procesing is not directly applicable, because tf.data.Dataset
is specifically designed to stream very large amounts of data, more precisely tensors:
A
Dataset
can be used to represent an input pipeline as a collection of elements (nested structures of tensors) and a "logical plan" of transformations that act on those elements.
The fact that tf.data.Dataset
operates with tensors means that obtaining any particular statistic over the data, such as min
or max
, requires a complete tf.Session
and at least one run through the whole pipeline. The following sample lines:
iterator = dataset.make_one_shot_iterator()
batch_x, batch_y = iterator.get_next()
... which are designed to provide the next batch fast, no matter of the size of the dataset, would stop the world until the first batch is ready, if the dataset
is responsible for pre-processing. That's why the "logical plan" includes only local transformations, which ensures the data can be streamed and, in addition, allows to do transformations in parallel.
This doesn't mean it's impossible to implement the normalization with tf.data.Dataset
, I feel like it's never been designed to do so and, as a result, it will look ugly (though I can't be absolutely sure of that). However, note that batch-normalization fits into this picture perfectly, and it's one of the "nice" options I see. Another option is do simple pre-processing in numpy and feed the result of that into tf.data.Dataset.from_tensor_slices
. This doesn't make the pipeline much more complicated, but doesn't restrict you from using tf.data.Dataset
at all.
Upvotes: 2