Padded_batch with pre- or post-padding option

Question

I have a dataset of variable-length sequences (a tensorflow TFRecord dataset) to feed an LSTM network and I want to try and compare pre- and post-padding in the batches, but current padded_batch function only pads at the sequences end. I know that we have tf.keras.preprocessing.sequence.pad_sequences function in API but I don't know how to apply this function to dataset batch processor. The padded_batch function in tensorflow does both padding and batching, and it will find the required paddding size per batch dynamically. How can I implement this myself? My code right now is like this, and I am reading multiple TFRecord files and interleave them to make my mixed dataset:

featuresDict = {'data': tf.FixedLenFeature([], dtype=tf.string),
                'rows': tf.FixedLenFeature([], dtype=tf.int64),
                'label': tf.FixedLenFeature([], dtype=tf.int64)
               }

def parse_tfrecord(example):
    features = tf.parse_single_example(example, featuresDict)
    label = tf.one_hot(features['label'],N)
    rows = features['rows']
    data = tf.decode_raw(features['data'], tf.int64)
    data = tf.reshape(data, (rows,num_features)
    return data, label

def read_datasets(pattern, numFiles, numEpochs=None, batchSize=None):
    files = tf.data.Dataset.list_files(pattern)

    def _parse(x):
        x = tf.data.TFRecordDataset(x, compression_type='GZIP')
        return x

    dataset = files.interleave(_parse, cycle_length=numFiles, block_length=1).map(parse_tfrecord)
    padded_shapes = (tf.TensorShape([None, num_features]), tf.TensorShape([N,])))
    dataset = dataset.padded_batch(batchSize, padded_shapes)
    dataset = dataset.prefetch(buffer_size=batchSize)
    dataset = dataset.repeat(numEpochs)
    return dataset

Padded_batch with pre- or post-padding option

Answers (1)

Related Questions