Tensorflow: reading the entire contents of a file exactly once into a single tensor

Question

I'm working with about ~300MB of word embedding data (currently an .npz but I'm willing to do the work to translate into any format) and I'd like to know if there is a way to get that data into tensorflow that doesn't involve initializing it in python (i.e. initializing a tf.Variable from a numpy array).

My reason for wanting to avoid this is that doing so causes tensorflow to dump my embeddings along with the graph definition when writing summaries. See https://github.com/tensorflow/tensorflow/issues/1444.

For my training data, I use the normal tensorflow reader pipeline (TFRecordReader, filename queues, tf.train.shuffle_batch). That's very good at reading fixed sized batches of examples for a predefined number of epochs. What I have no idea how to do is read the entire contents of a file into a single tensor. I could solve this pretty easily by just reading a single batch that is the full size of my embeddings, but I'd like a more general solution that doesn't rely on knowing the number of records, just the individual record format.

mrry · Accepted Answer

The easiest way to achieve this would be to create a tf.Variable of the appropriate type and shape by initializing it from a tf.placeholder(), then use the feed mechanism to pass in the value. As a result, the actual value will never appear in the graph itself.

Let's say your embedding matrix is 1000 x 100:

embedding_init = tf.placeholder(tf.float32, shape=[1000, 100])
embedding = tf.Variable(embedding_init)

You can then initialize the variable with the value from your .npz file:

datafile = numpy.load("data.npz")
embedding_value = datafile["embedding"]

sess = tf.Session()
sess.run(tf.initialize_all_variables(),
         feed_dict={embedding_init: embedding_value})

datafile.close()

Tensorflow: reading the entire contents of a file exactly once into a single tensor

Answers (1)

Related Questions