Trouble Using tf.data API, TFRecordDataset and Serialization

Question

I have a large dataset of numpy integers which I want to analyze with a GPU. The dataset is too large to fit into main memory on the GPU so I am trying to serialize them into a TFRecord and then use the API to stream the record for processing. The below code is example code: it wants to create some fake data, serialize it into the TFRecord object, then using a TF session read the data back into memory, parsing with the map() function. My original data is non-homogenous in terms of the dimensions of the numpy arrays, though each is a 3D array with 10 as the length of the first axis. I recreated the hetorogeneity using random numbers when I made the fake data. The idea is to store the size of each image as I serialize the data, and I can use that to restore each array to its original size. It is most definitely not working for whatever reason. Here is the code:

import numpy as np
from skimage import io
from skimage.io import ImageCollection 
import tensorflow as tf
import argparse

#A function for parsing TFRecords
def record_parser(record):
    keys_to_features = {
            'fil' : tf.FixedLenFeature([],tf.string),
            'm'   : tf.FixedLenFeature([],tf.int64),
            'n'   : tf.FixedLenFeature([],tf.int64)} 

    parsed = tf.parse_single_example(record, keys_to_features)

    m    = tf.cast(parsed['m'],tf.int32)
    n    = tf.cast(parsed['n'],tf.int32)

    fil_shape = tf.stack([10,m,n])
    fil = tf.decode_raw(parsed['fil'],tf.float32)
    fil = tf.reshape(fil,fil_shape)

    return (fil,m,n)

#For writing and reading from the TFRecord
filename = "test.tfrecord"

if __name__ == "__main__":

    #Create the TFRecordWriter
    data_writer = tf.python_io.TFRecordWriter(filename)

    #Create some fake data
    files = []
    i_vals = np.random.randint(20,size=10)
    j_vals = np.random.randint(20,size=10)

    print(i_vals)
    print(j_vals)
    for x in range(5):
        files.append(np.random.rand(10,i_vals[x],j_vals[x]))

    #Serialize the fake data and record it as a TFRecord using the TFRecordWriter
    for fil in files:

        f,m,n = fil.shape
        fil_raw = fil.tostring()

        print("fil.shape: ",fil.shape)

        example = tf.train.Example(
            features = tf.train.Features(
                feature = {
                    'fil' : tf.train.Feature(bytes_list=tf.train.BytesList(value=[fil_raw])),
                    'm'   : tf.train.Feature(int64_list=tf.train.Int64List(value=[m])),
                    'n'   : tf.train.Feature(int64_list=tf.train.Int64List(value=[n]))
                }
            )
        )
        data_writer.write(example.SerializeToString())
    data_writer.close()

    #Deserialize and report on the fake data
    sess = tf.Session()

    dataset = tf.data.TFRecordDataset([filename])
    dataset = dataset.map(record_parser)

    iterator = dataset.make_initializable_iterator()

    next_element = iterator.get_next()

    sess.run(iterator.initializer)
    while True:
        try:
            sess.run(next_element)
            fil,m,n = next_element
            print("fil.shape: ",file.shape)
            print("M: ",m)
            print("N: ",n)
        except tf.errors.OutOfRangeError:
            break

The error gets thrown in the map() function:

MacBot$ python test.py
/Users/MacBot/anaconda/envs/tflow/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
[ 2 12 17 18 19 15 11  5  0 12]
[13  5  3  5  2  6  5 11 12 10]
fil.shape:  (10, 2, 13)
fil.shape:  (10, 12, 5)
fil.shape:  (10, 17, 3)
fil.shape:  (10, 18, 5)
fil.shape:  (10, 19, 2)
2018-04-03 09:01:18.382870: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-03 09:01:18.420114: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at iterator_ops.cc:870 : Invalid argument: Input to reshape is a tensor with 520 values, but the requested shape has 260
     [[Node: Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32](DecodeRaw, stack)]]
Traceback (most recent call last):
  File "/Users/MacBot/anaconda/envs/tflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/Users/MacBot/anaconda/envs/tflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/Users/MacBot/anaconda/envs/tflow/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 520 values, but the requested shape has 260
     [[Node: Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32](DecodeRaw, stack)]]
     [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[10,?,?], [], []], output_types=[DT_FLOAT, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]

Anybody with a bit of insight into this problem? Your help is much appreciated! It may be worth noting that the data is always seemingly twice the size I'm expecting it to be...

xdurch0 · Accepted Answer

It seems like you are writing the results of np.random.rand as they are. However, this returns float64 values. On the other hand you are telling Tensorflow to interpret the bytes as float32. This is a mismatch -- and would explain why there are twice as many numbers as you expect (since there are twice as many bytes!).

Try using files.append(np.random.rand(10,i_vals[x],j_vals[x]).astype(np.float32)) instead. Using float32 is recommended for CUDA. You need to be careful with this in general: By default numpy uses float64 (but int32) in most places.

Trouble Using tf.data API, TFRecordDataset and Serialization

Answers (1)

Related Questions