Reputation: 13
In the example below (Tensorflow 2.0), we have a dummy tensorflow dataset with three elements. We map a function on it (replace_with_float
) that returns a randomly generated value in two copies. As we expect, when we take elements from the dataset, the first and second coordinates have the same value.
Now, we create two "slice" datasets from the first coordinates and the second coordinates, respectively and we zip the two datasets back together. The slicing and the zipping operations seems inverses of each other, so I would expect the resulting dataset to be equivalent to the previous one. However, as we see, now the first and second coordinates are different randomly generated values.
Maybe even more interestingly, if we zip the "same" dataset with itself by
df = tf.data.Dataset.zip((df.map(lambda x, y: x), df.map(lambda x, y: x)))
, the two coordinates will also have different values.
How can this behavior be explained? Perhaps two different graphs are constructed for the two datasets to be zipped and they are run independently?
import tensorflow as tf
def replace_with_float(element):
rand = tf.random.uniform([])
return (rand, rand)
df = tf.data.Dataset.from_tensor_slices([0, 0, 0])
df = df.map(replace_with_float)
print('Before zipping: ')
for x in df:
print(x[0].numpy(), x[1].numpy())
df = tf.data.Dataset.zip((df.map(lambda x, y: x), df.map(lambda x, y: y)))
print('After zipping: ')
for x in df:
print(x[0].numpy(), x[1].numpy())
Sample output:
Before zipping:
0.08801079 0.08801079
0.638958 0.638958
0.800568 0.800568
After zipping:
0.9676769 0.23045003
0.91056764 0.6551999
0.4647777 0.6758332
Upvotes: 1
Views: 236
Reputation: 975
The short answer is that datasets don't cache intermediate values between full iterations, unless you explicitly request that using df.cache()
, and they don't deduplicate common inputs either.
So in the second loop, the entire pipeline runs again.
Similarly, in the second instance, the two df.map
calls cause df
to run twice.
Adding a tf.print
helps explain what happens:
def replace_with_float(element):
rand = tf.random.uniform([])
tf.print('replacing', element, 'with', rand)
return (rand, rand)
I've also pulled the lambdas on separate lines to avoid the autograph warning:
first = lambda x, y: x
second = lambda x, y: y
df = tf.data.Dataset.zip((df.map(first), df.map(second)))
Before zipping:
replacing 0 with 0.624579549
0.62457955 0.62457955
replacing 0 with 0.471772075
0.47177207 0.47177207
replacing 0 with 0.394005418
0.39400542 0.39400542
After zipping:
replacing 0 with 0.537954807
replacing 0 with 0.558757305
0.5379548 0.5587573
replacing 0 with 0.839109302
replacing 0 with 0.878996611
0.8391093 0.8789966
replacing 0 with 0.0165234804
replacing 0 with 0.534951568
0.01652348 0.53495157
To avoid the duplicate input problem, you can use use a single map
call:
swap = lambda x, y: (y, x)
df = df.map(swap)
Or you can use df = df.cache()
to avoid both effects:
df = df.map(replace_with_float)
df = df.cache()
Before zipping:
replacing 0 with 0.728474379
0.7284744 0.7284744
replacing 0 with 0.419658661
0.41965866 0.41965866
replacing 0 with 0.911524653
0.91152465 0.91152465
After zipping:
0.7284744 0.7284744
0.41965866 0.41965866
0.91152465 0.91152465
Upvotes: 1