Reading binary data in float32

Question

I want to train a network using Tensorflow, based on features from a time signal. Data is split up in E 3 second epochs with F features for each epoch. Thus, the data has the form

Epoch | Feature 1 | Feature 2 | ... | Feature F |
-------------------------------------------------
1     | ..        | ..        |     | ..        |
      | ..        | ..        |     | ..        |
E     | ..        | ..        |     | ..        |

Loading data to Tensorflow, I am trying to follow the cifar example and using tf.FixedLengthRecordReader. Thus, I have taken the data, and saved it to a binary file of type float32 with first label for the first epoch, followed by the F features for the first epoch, then the second, etc.

Reading this into Tensorflow is a challenge for me, however. Here is my code:

def read_data_file(file_queue):

    class DataRecord(object):
        pass

    result = DataRecord()

    #1 float32 as label => 4 bytes
    label_bytes = 4

    #NUM_FEATURES as float32 => 4 * NUM_FEATURES
    features_bytes = 4 * NUM_FEATURES

    #Create the read operator with the summed amount of bytes
    reader = tf.FixedLengthRecordReader(record_bytes=label_bytes+features_bytes)

    #Perform the operation
    result.key, value = reader.read(file_queue)

    #Decode the result from bytes to float32
    value_bytes = tf.decode_raw(value, tf.float32, little_endian=True)

    #Cast label to int for later
    result.label = tf.cast(tf.slice(value_bytes, [0], [label_bytes]), tf.int32)

    #Cast features to float32
    result.features = tf.cast(tf.slice(value_bytes, [label_bytes],
        [features_bytes]), tf.float32)

    print ('>>>>>>>>>>>>>>>>>>>>>>>>>>>')
    print ('%s' % result.label)
    print ('%s' % result.features)
    print ('>>>>>>>>>>>>>>>>>>>>>>>>>>>')

Print output was:

Tensor("Cast:0", shape=TensorShape([Dimension(4)]), dtype=int32)
Tensor("Slice_1:0", shape=TensorShape([Dimension(40)]), dtype=float32)

Which surprises me, because since I have cast the values to float32, I expected the dimensions to be respectively 1 and 10, which are the actual numbers, but they are 4 and 40, which corresponds to the byte lengths.

How come?

mrry · Accepted Answer

I think the issue stems from the fact that tf.decode_raw(value, tf.float32, little_endian=True) returns a vector of type tf.float32 rather than a vector of bytes. The slice size for extracting the features should be specified as a count of floating-point values (i.e. NUM_FEATURES) rather than a count of bytes (features_bytes).

However, there's the slight wrinkle that your label is an integer, while the rest of the vector contains floating-point values. TensorFlow doesn't have many facilities for casting between binary representations (except tf.decode_raw()), so you'll have to decode the string twice into different types:

# Decode the result from bytes to int32
value_as_ints = tf.decode_raw(value, tf.int32, little_endian=True)
result.label = value_as_ints[0]

# Decode the result from bytes to float32
value_as_floats = tf.decode_raw(value, tf.float32, little_endian=True)
result.features = value_as_floats[1:1+NUM_FEATURES]

Note that this only works because sizeof(tf.int32) == sizeof(tf.float32), which wouldn't be true in general. Some more string manipulation tools would be useful for slicing out appropriate substrings of the raw value in the more general case. Hopefully this should be enough to get you going, though.

Reading binary data in float32

Answers (1)

Related Questions