Slow serialization of ndarray to TFRecord

Question

I'd like to serialize large numpy ndarray to TFRecord. Trouble is, the process if painfully slow. For an array size of (1000000, 65) it takes almost a minute. Serializing the same to other binary format (HDF5, npy, parquet...) takes less than a second. I am pretty sure there's a much faster way to serialize it, but I just can't figure it out.

import numpy as np
import tensorflow as tf

X = np.random.randn(1000000, 65)

def write_tf_dataset(data: np.ndarray, path: str):
    with tf.io.TFRecordWriter(path=path) as writer:
        for record in data:
            feature = {'X': tf.train.Feature(float_list=tf.train.FloatList(value=record[:42])),
                       'Y': tf.train.Feature(float_list=tf.train.FloatList(value=record[42:64])),
                       'Z': tf.train.Feature(float_list=tf.train.FloatList(value=[record[64]]))}
            example = tf.train.Example(features=tf.train.Features(feature=feature))
            serialized = example.SerializeToString()
            writer.write(serialized)

write_tf_dataset(X, 'X.tfrecord')

How to increase performance of write_tf_dataset? Size of my X is 200x larger than in the snippet.

I am not the first one to complain about the slow performance of TFRecord. Based on this Tensorflow Github issue I made a second version of the function:

import pickle

def write_tf_dataset(data: np.ndarray, path: str):
    with tf.io.TFRecordWriter(path=path) as writer:
        for record in data:
            feature = {
                'X': tf.io.serialize_tensor(record[:42]).numpy(),
                'Y': tf.io.serialize_tensor(record[42:64]).numpy(),
                'Z': tf.io.serialize_tensor(record[64]).numpy(),
            }
            serialized = pickle.dumps(feature)
            writer.write(serialized)

... but if performed even worse. Ideas?

Slow serialization of ndarray to TFRecord

Answers (1)

Related Questions