DeltaIV
DeltaIV

Reputation: 5646

Convert a Tensorflow dataset containing inputs and labels to two NumPy arrays

I'm using Tensorflow 2.9.1. I have a test_dataset object of class tf.data.Dataset, which stores both inputs and labels. The inputs are 4-dimensional Tensors, and the labels are 3-dimensional Tensors:

print(tf.data.Dataset)
<PrefetchDataset element_spec=(TensorSpec(shape=(64, 5, 548, 1), dtype=tf.float64, name=None), TensorSpec(shape=(64, 1, 1), dtype=tf.float64, name=None))>

The first dimension is the minibatch size. I need to convert this Tensorflow Dataset to two NumPy arrays, X_test containing the inputs, and y_test containing the labels, ordered in the same way. In other words, (X_test[0], y_test[0]) must correspond to the first sample from test_dataset. Since the first dimension of my tensors is the minibatch size, I want to concatenate the results along that first dimension.

How can I do that? I've seen two approaches:

np.concatenate

X_test = np.concatenate([x for x, _ in test_dataset], axis=0)
y_test = np.concatenate([y for _, y in test_dataset], axis=0)

But I don't like it for two reasons:

  1. it seems wasteful to iterate twice on the same dataset

  2. X_test and y_test are probably not ordered in the same way. If I run

    X_test = np.concatenate([x for x, _ in test_dataset], axis=0) X_test2 = np.concatenate([x for x, _ in test_dataset], axis=0)

X_test and X_test2 are different arrays, though of identical shape. I suspect the dataset is being shuffled after I itera through it once. However, this implies that also X_test and y_test, in my snippet above, won't be ordered in the same way. How can I fix that?

tfds.as_numpy

tfds.as_numpy can be used to convert a Tensorflow Dataset to an iterable of NumPy arrays:

import tensorflow_datasets as tfds
np_test_dataset = tfds.as_numpy(test_dataset)
print(np_test_dataset)
<generator object _eager_dataset_iterator at 0x7fee81fd8b30>

However, I don't know how to proceed from here: how do I convert this iterable of NumPy arrays, to two NumPy arrays of the right shapes?

Upvotes: 0

Views: 1051

Answers (1)

AlexK
AlexK

Reputation: 3011

Instead of iterating over the dataset twice, you can unpack the dataset and concatenate the arrays inside the resulting tuples to get the final result.

The zip(*ds) is used to separate the dataset into two separate sequences (X's and y's). X and y each becomes a tuple of arrays and you then concatenate those arrays. You can read more about how zip(*iterables) works here.

Here is an example with mnist data:

import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds

ds = tfds.load('mnist', split='train', as_supervised=True)
print(ds)
# <PrefetchDataset element_spec=(TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

X, y = zip(*ds)
print(type(X), type(y))
# <class 'tuple'> <class 'tuple'>
print(len(X), len(y))
# 60000 60000

X_arr = np.concatenate(X)
print(X_arr.shape)
# (1680000, 28, 1)

You would do the same concatenation with your y's. I am not showing it here because this dataset has different dimensionality. np.concat is used here since you want to join arrays on the first existing axis.

If needed, unpacking could also be done on the iterable created by the tfds.as_numpy() method:

X, y = zip(*tfds.as_numpy(ds))

Upvotes: 1

Related Questions