Convert a Tensorflow dataset containing inputs and labels to two NumPy arrays

Question

I'm using Tensorflow 2.9.1. I have a test_dataset object of class tf.data.Dataset, which stores both inputs and labels. The inputs are 4-dimensional Tensors, and the labels are 3-dimensional Tensors:

print(tf.data.Dataset)

The first dimension is the minibatch size. I need to convert this Tensorflow Dataset to two NumPy arrays, X_test containing the inputs, and y_test containing the labels, ordered in the same way. In other words, (X_test[0], y_test[0]) must correspond to the first sample from test_dataset. Since the first dimension of my tensors is the minibatch size, I want to concatenate the results along that first dimension.

How can I do that? I've seen two approaches:

np.concatenate

X_test = np.concatenate([x for x, _ in test_dataset], axis=0)
y_test = np.concatenate([y for _, y in test_dataset], axis=0)

But I don't like it for two reasons:

it seems wasteful to iterate twice on the same dataset
X_test and y_test are probably not ordered in the same way. If I run

X_test = np.concatenate([x for x, _ in test_dataset], axis=0) X_test2 = np.concatenate([x for x, _ in test_dataset], axis=0)

X_test and X_test2 are different arrays, though of identical shape. I suspect the dataset is being shuffled after I itera through it once. However, this implies that also X_test and y_test, in my snippet above, won't be ordered in the same way. How can I fix that?

tfds.as_numpy

tfds.as_numpy can be used to convert a Tensorflow Dataset to an iterable of NumPy arrays:

import tensorflow_datasets as tfds
np_test_dataset = tfds.as_numpy(test_dataset)
print(np_test_dataset)

However, I don't know how to proceed from here: how do I convert this iterable of NumPy arrays, to two NumPy arrays of the right shapes?

Convert a Tensorflow dataset containing inputs and labels to two NumPy arrays

np.concatenate

tfds.as_numpy

Answers (1)

Related Questions