Balázs Strenner
Balázs Strenner

Reputation: 13

Weird behavior of zipped tensorflow dataset with random tensors

In the example below (Tensorflow 2.0), we have a dummy tensorflow dataset with three elements. We map a function on it (replace_with_float) that returns a randomly generated value in two copies. As we expect, when we take elements from the dataset, the first and second coordinates have the same value.

Now, we create two "slice" datasets from the first coordinates and the second coordinates, respectively and we zip the two datasets back together. The slicing and the zipping operations seems inverses of each other, so I would expect the resulting dataset to be equivalent to the previous one. However, as we see, now the first and second coordinates are different randomly generated values.

Maybe even more interestingly, if we zip the "same" dataset with itself by df = tf.data.Dataset.zip((df.map(lambda x, y: x), df.map(lambda x, y: x))), the two coordinates will also have different values.

How can this behavior be explained? Perhaps two different graphs are constructed for the two datasets to be zipped and they are run independently?

import tensorflow as tf

def replace_with_float(element):
    rand = tf.random.uniform([])
    return (rand, rand)

df = tf.data.Dataset.from_tensor_slices([0, 0, 0])
df = df.map(replace_with_float)
print('Before zipping: ')
for x in df:
    print(x[0].numpy(), x[1].numpy())

df = tf.data.Dataset.zip((df.map(lambda x, y: x), df.map(lambda x, y: y)))

print('After zipping: ')
for x in df:
    print(x[0].numpy(), x[1].numpy())

Sample output:

Before zipping: 
0.08801079 0.08801079
0.638958 0.638958
0.800568 0.800568
After zipping: 
0.9676769 0.23045003
0.91056764 0.6551999
0.4647777 0.6758332

Upvotes: 1

Views: 236

Answers (1)

Dan Moldovan
Dan Moldovan

Reputation: 975

The short answer is that datasets don't cache intermediate values between full iterations, unless you explicitly request that using df.cache(), and they don't deduplicate common inputs either.

So in the second loop, the entire pipeline runs again. Similarly, in the second instance, the two df.map calls cause df to run twice.

Adding a tf.print helps explain what happens:

def replace_with_float(element):
    rand = tf.random.uniform([])
    tf.print('replacing', element, 'with', rand)
    return (rand, rand)

I've also pulled the lambdas on separate lines to avoid the autograph warning:

first = lambda x, y: x
second = lambda x, y: y

df = tf.data.Dataset.zip((df.map(first), df.map(second)))
Before zipping: 
replacing 0 with 0.624579549
0.62457955 0.62457955
replacing 0 with 0.471772075
0.47177207 0.47177207
replacing 0 with 0.394005418
0.39400542 0.39400542

After zipping: 
replacing 0 with 0.537954807
replacing 0 with 0.558757305
0.5379548 0.5587573
replacing 0 with 0.839109302
replacing 0 with 0.878996611
0.8391093 0.8789966
replacing 0 with 0.0165234804
replacing 0 with 0.534951568
0.01652348 0.53495157

To avoid the duplicate input problem, you can use use a single map call:

swap = lambda x, y: (y, x)
df = df.map(swap)

Or you can use df = df.cache() to avoid both effects:

df = df.map(replace_with_float)
df = df.cache()
Before zipping: 
replacing 0 with 0.728474379
0.7284744 0.7284744
replacing 0 with 0.419658661
0.41965866 0.41965866
replacing 0 with 0.911524653
0.91152465 0.91152465

After zipping: 
0.7284744 0.7284744
0.41965866 0.41965866
0.91152465 0.91152465

Upvotes: 1

Related Questions