Reputation: 3840
I want to train a network using Tensorflow, based on features from a time signal. Data is split up in E
3 second epochs with F
features for each epoch. Thus, the data has the form
Epoch | Feature 1 | Feature 2 | ... | Feature F |
-------------------------------------------------
1 | .. | .. | | .. |
| .. | .. | | .. |
E | .. | .. | | .. |
Loading data to Tensorflow, I am trying to follow the cifar example and using tf.FixedLengthRecordReader
. Thus, I have taken the data, and saved it to a binary file of type float32
with first label for the first epoch, followed by the F
features for the first epoch, then the second, etc.
Reading this into Tensorflow is a challenge for me, however. Here is my code:
def read_data_file(file_queue):
class DataRecord(object):
pass
result = DataRecord()
#1 float32 as label => 4 bytes
label_bytes = 4
#NUM_FEATURES as float32 => 4 * NUM_FEATURES
features_bytes = 4 * NUM_FEATURES
#Create the read operator with the summed amount of bytes
reader = tf.FixedLengthRecordReader(record_bytes=label_bytes+features_bytes)
#Perform the operation
result.key, value = reader.read(file_queue)
#Decode the result from bytes to float32
value_bytes = tf.decode_raw(value, tf.float32, little_endian=True)
#Cast label to int for later
result.label = tf.cast(tf.slice(value_bytes, [0], [label_bytes]), tf.int32)
#Cast features to float32
result.features = tf.cast(tf.slice(value_bytes, [label_bytes],
[features_bytes]), tf.float32)
print ('>>>>>>>>>>>>>>>>>>>>>>>>>>>')
print ('%s' % result.label)
print ('%s' % result.features)
print ('>>>>>>>>>>>>>>>>>>>>>>>>>>>')
Print output was:
Tensor("Cast:0", shape=TensorShape([Dimension(4)]), dtype=int32)
Tensor("Slice_1:0", shape=TensorShape([Dimension(40)]), dtype=float32)
Which surprises me, because since I have cast the values to float32, I expected the dimensions to be respectively 1 and 10, which are the actual numbers, but they are 4 and 40, which corresponds to the byte lengths.
How come?
Upvotes: 1
Views: 2287
Reputation: 126174
I think the issue stems from the fact that tf.decode_raw(value, tf.float32, little_endian=True)
returns a vector of type tf.float32
rather than a vector of bytes. The slice size for extracting the features should be specified as a count of floating-point values (i.e. NUM_FEATURES
) rather than a count of bytes (features_bytes
).
However, there's the slight wrinkle that your label is an integer, while the rest of the vector contains floating-point values. TensorFlow doesn't have many facilities for casting between binary representations (except tf.decode_raw()
), so you'll have to decode the string twice into different types:
# Decode the result from bytes to int32
value_as_ints = tf.decode_raw(value, tf.int32, little_endian=True)
result.label = value_as_ints[0]
# Decode the result from bytes to float32
value_as_floats = tf.decode_raw(value, tf.float32, little_endian=True)
result.features = value_as_floats[1:1+NUM_FEATURES]
Note that this only works because sizeof(tf.int32) == sizeof(tf.float32)
, which wouldn't be true in general. Some more string manipulation tools would be useful for slicing out appropriate substrings of the raw value
in the more general case. Hopefully this should be enough to get you going, though.
Upvotes: 2