Reputation: 415
I'm working with about ~300MB of word embedding data (currently an .npz but I'm willing to do the work to translate into any format) and I'd like to know if there is a way to get that data into tensorflow that doesn't involve initializing it in python (i.e. initializing a tf.Variable
from a numpy array).
My reason for wanting to avoid this is that doing so causes tensorflow to dump my embeddings along with the graph definition when writing summaries. See https://github.com/tensorflow/tensorflow/issues/1444.
For my training data, I use the normal tensorflow reader pipeline (TFRecordReader, filename queues, tf.train.shuffle_batch
). That's very good at reading fixed sized batches of examples for a predefined number of epochs. What I have no idea how to do is read the entire contents of a file into a single tensor. I could solve this pretty easily by just reading a single batch that is the full size of my embeddings, but I'd like a more general solution that doesn't rely on knowing the number of records, just the individual record format.
Upvotes: 4
Views: 1066
Reputation: 126154
The easiest way to achieve this would be to create a tf.Variable
of the appropriate type and shape by initializing it from a tf.placeholder()
, then use the feed mechanism to pass in the value. As a result, the actual value will never appear in the graph itself.
Let's say your embedding matrix is 1000 x 100:
embedding_init = tf.placeholder(tf.float32, shape=[1000, 100])
embedding = tf.Variable(embedding_init)
You can then initialize the variable with the value from your .npz
file:
datafile = numpy.load("data.npz")
embedding_value = datafile["embedding"]
sess = tf.Session()
sess.run(tf.initialize_all_variables(),
feed_dict={embedding_init: embedding_value})
datafile.close()
Upvotes: 4