University
University

Reputation: 11

ELMO embedding start session

I have an error when I apply the Elmo embedding to my data. I have 7255 sentences.

embeddings = embed(
    sentences,
    signature="default",
    as_dict=True)['default']

#Start a session and run ELMo to return the embeddings in variable x
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  sess.run(tf.tables_initializer())
  x = sess.run(embeddings)

The error is:

ResourceExhaustedError: OOM when allocating tensor with shape[36021075,50] and type int32 on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [[node module_apply_default/map/TensorArrayStack/TensorArrayGatherV3 (defined at C:\Users...\envs\tf_36\lib\site-packages\tensorflow_hub\native_module.py:547) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Upvotes: 1

Views: 205

Answers (1)

Jindřich
Jindřich

Reputation: 11220

ELMo is a large model. There are 2048-dimensional word embeddings, 4096-dimensional LSTM states in 2 layers and 2 directions. Only this is 18k floats, 71 kB per word (and there is much more: intermediate projections in LSTMs, character level CNN for word representation). You have 7,255 sentences, the average sentence has 25 words, this gives 12 GB RAM, but it is a very conservative estimate.

You need to split the sentences into batches and process the batches iteratively. There are many ways how to do that and I don't what implementation you use and what exactly is in variable sentences. But you probably can call tf.split on sentences and get a list of objects for which call session independently or if you use tf.dataset, you can use batching provided by the dataset API. You can always also split your data and use multiple input files.

Upvotes: 1

Related Questions