desertnaut
desertnaut

Reputation: 60338

Tensorflow Datasets with string inputs do not preserve data type

All reproducible code below is run at Google Colab with TF 2.2.0-rc2.

Adapting the simple example from the documentation for creating a dataset from a simple Python list:

import numpy as np
import tensorflow as tf
tf.__version__
# '2.2.0-rc2'
np.version.version
# '1.18.2'

dataset1 = tf.data.Dataset.from_tensor_slices([1, 2, 3]) 
for element in dataset1: 
  print(element) 
  print(type(element.numpy()))

we get the result

tf.Tensor(1, shape=(), dtype=int32)
<class 'numpy.int32'>
tf.Tensor(2, shape=(), dtype=int32)
<class 'numpy.int32'>
tf.Tensor(3, shape=(), dtype=int32)
<class 'numpy.int32'>

where all data types are int32, as expected.

But changing this simple example to feed a list of strings instead of integers:

dataset2 = tf.data.Dataset.from_tensor_slices(['1', '2', '3']) 
for element in dataset2: 
  print(element) 
  print(type(element.numpy()))

gives the result

tf.Tensor(b'1', shape=(), dtype=string)
<class 'bytes'>
tf.Tensor(b'2', shape=(), dtype=string)
<class 'bytes'>
tf.Tensor(b'3', shape=(), dtype=string)
<class 'bytes'>

where, surprisingly, and despite the tensors themselves being of dtype=string, their evaluations are of type bytes.

This behavior is not confined to the .from_tensor_slices method; here is the situation with .list_files (the following snippet runs straightforward in a fresh Colab notebook):

disc_data = tf.data.Dataset.list_files('sample_data/*.csv') # 4 csv files
for element in disc_data: 
  print(element) 
  print(type(element.numpy()))

the result being:

tf.Tensor(b'sample_data/california_housing_test.csv', shape=(), dtype=string)
<class 'bytes'>
tf.Tensor(b'sample_data/mnist_train_small.csv', shape=(), dtype=string)
<class 'bytes'>
tf.Tensor(b'sample_data/california_housing_train.csv', shape=(), dtype=string)
<class 'bytes'>
tf.Tensor(b'sample_data/mnist_test.csv', shape=(), dtype=string)
<class 'bytes'>

where again, the file names in the evaluated tensors are returned as bytes, instead of string, despite that the tensors themselves are of dtype=string.

Similar behavior is observed also with the .from_generator method (not shown here).

A final demonstration: as shown in the .as_numpy_iterator method documentation, the following equality condition is evaluated as True:

dataset3 = tf.data.Dataset.from_tensor_slices({'a': ([1, 2], [3, 4]), 
                                               'b': [5, 6]}) 

list(dataset3.as_numpy_iterator()) == [{'a': (1, 3), 'b': 5}, 
                                       {'a': (2, 4), 'b': 6}] 
# True

but if we change the elements of b to be strings, the equality condition is now surprisingly evaluated as False!

dataset4 = tf.data.Dataset.from_tensor_slices({'a': ([1, 2], [3, 4]), 
                                               'b': ['5', '6']})   # change elements of b to strings

list(dataset4.as_numpy_iterator()) == [{'a': (1, 3), 'b': '5'},   # here
                                       {'a': (2, 4), 'b': '6'}]   # also
# False

probably due to the different data types, since the values themselves are evidently identical.


I didn't stumble upon this behavior by academic experimentation; I am trying to pass my data to TF Datasets using custom functions that read pairs of files from the disk of the form

f = ['filename1', 'filename2']

which custom functions work perfectly well on their own, but mapped through TF Datasets give

RuntimeError: not a string

which, after this digging, seems at least not unexplained, if the returned data types are indeed bytes and not string.

So, is this a bug (as it seems), or am I missing something here?

Upvotes: 4

Views: 1544

Answers (1)

nessuno
nessuno

Reputation: 27052

This is a known behavior:

From: https://github.com/tensorflow/tensorflow/issues/5552#issuecomment-260455136

TensorFlow converts str to bytes in most places, including sess.run, and this is unlikely to change. The user is free to convert back, but unfortunately it's too large a change to add a unicode dtype to the core. Closing as won't fix for now.

I guess nothing changed with TensorFlow 2.x - there are still places in which strings are converted to bytes and you have to take care of this manually.

From the issue you have opened yourself, it would seem that they treat the subject as a problem of Numpy, and not of Tensorflow itself.

Upvotes: 6

Related Questions