delyeet
delyeet

Reputation: 168

How to Convert Reading of SequenceExample Objects from tf.python_io.tf_record_iterator to tf.data.TFRecordDataset

So I have a dataset in the TFRecords format, and I am trying to convert reading the dataset with tf.python_io.tf_record_iterator to tf.data.TFRecordDataset.

Outside of tf.python_io.tf_record_iterator being deprecated, the main reason for doing this is that I would like to be able to use tf.data.Dataset objects.

Within the TFRecords file, each entry is a SequenceExample, specifically tensorflow.core.example.example_pb2.SequenceExample.

Currently I am reading out each SequenceExample via this function:

def read_records(record_path):
    records = []
    record_iterator = tf.python_io.tf_record_iterator(path=record_path)

    for string_record in record_iterator:
        example = tf.train.SequenceExample()
        example.ParseFromString(string_record)
        records.append(example)

    return records

Printing out a record gives me this kind of structure (truncated due to length):

context {
  feature {
    key: "framecount"
    value {
      int64_list {
        value: 10
      }
    }
  }
  feature {
    key: "label"
    value {
      int64_list {
        value: 1
      }
    }
  }
}
feature_lists {
  feature_list {
    key: "positions"
    value {
      feature {
        bytes_list {
          value: "\221\2206?\200dL?\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
        }
      }
    }
  }
}

Now if I attempt to do this with tf.data.TFRecordDataset, my function is:

def reader(file_path):
    dataset = tf.data.TFRecordDataset(file_path)
    for record in dataset:
        tf.io.parse_sequence_example(record)

    return dataset

I am given a value error, stating that I have not supplied value or context features. Which is true, because the record has said values. (I have additionally attempted to follow the same flow for the first function with training a new SequenceExample, though it seems the data TFRecordDataset outputs is different from the old record iterator).

Given this, how would I properly generate my sequenceExample? Though I could technically give it parameters to work with, this seems counter intuitive especially since the data is already in the record.

Alternatively, (though this would be more of a band-aid fix) how could I convert the list in the first function into a tensorflow dataset object?

Upvotes: 0

Views: 1106

Answers (1)

delyeet
delyeet

Reputation: 168

Okay, so this one was a little tricky...

It seems that tf.python_io.tf_record_iterator outputs the data in a direct binary format that SequenceExample.FromString() can parse. On the other hand, TFRecordDataset returns the data in a direct tensor format.

Since my intent was to be able to pass my datapoints to a model via the built-in generator ability of the Dataset object, I can get around it by wrapping the output of TFRecordDataset. Specifically, I can use SequenceExample.FromString(datapoint.numpy()) to get the desired output.

This is a little wordy, so my solution function follows:

def reader(file_path):
    dataset = tf.data.TFRecordDataset(file_path)
    for record in dataset:
        record = tf.train.SequenceExample.FromString(record.numpy())
        yield record

This is a direct modification of the second function in my question

Upvotes: 1

Related Questions