tensorflow.dataset.shuffle = tensorflow.dataset.prefetch + then shuffle internally?

Question

According to the documentation of tf.dataset.shuffle, it will fill in a buffer with size k then shuffle inside of it. Tho I don't want the order of data to be changed, I want it to be buffered. Then I found there is tf.dataset.prefetch, which says "This allows later elements to be prepared while the current element is being processed."

From the description I guess prefetch is what I want (i.e. pre-loading the data while the pervious data are being used in training), but while trying to look into the code of tf.dataset.shuffle to see if they actually call tf.dataset.prefetch, I got stuck in these lines (paste them below), cannot find where is shuffle_dataset_v3 defined.

      variant_tensor = gen_dataset_ops.shuffle_dataset_v3(
          input_dataset._variant_tensor,  # pylint: disable=protected-access
          buffer_size=self._buffer_size,
          seed=self._seed,
          seed2=self._seed2,
          seed_generator=gen_dataset_ops.dummy_seed_generator(),
          reshuffle_each_iteration=self._reshuffle_each_iteration,
          **self._flat_structure)

My major question is whether prefetch is the replacement of shuffle in terms of buffering the data, and it would also be nice if someone can point me to where shuffle_dataset_v3 was implemented?

Fan Luo · Accepted Answer

Yes. Prefetch is for buffering data.
gen_dataset_ops, and other gen_xxx_ops are not included in source code because it is automatically generated by bazel to wrap C++ implementation for use in python. You should be able to find these gen_xxx_ops code in your local installation. For example, ${PYTHON_ROOT}/site-packages/tensorflow/python/ops/gen_dataset_ops.py

tensorflow.dataset.shuffle = tensorflow.dataset.prefetch + then shuffle internally?

Answers (1)

Related Questions