Reputation: 287
My setup has an NVIDIA P100 GPU. I am working on a Google BERT model to answer questions. I am using the SQuAD question-answering dataset, which gives me questions, and paragraphs from which the answers should be drawn, and my research indicates this architecture should be OK, but I keep getting OutOfMemory errors during training:
ResourceExhaustedError: OOM when allocating tensor with shape[786432,1604] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node dense_3/kernel/Initializer/random_uniform/RandomUniform}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Below, please find a full program that uses someone else's implementation of Google's BERT algorithm inside my own model. Please let me know what I can do to fix my error. Thank you!
import json
import numpy as np
import pandas as pd
import os
assert os.path.isfile("train-v1.1.json"),"Non-existent file"
from tensorflow.python.client import device_lib
import tensorflow.compat.v1 as tf
#import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import re
regex = re.compile(r'\W+')
#Reading the files.
def readFile(filename):
with open(filename) as file:
fields = []
JSON = json.loads(file.read())
articles = []
for article in JSON["data"]:
articleTitle = article["title"]
article_body = []
for paragraph in article["paragraphs"]:
paragraphContext = paragraph["context"]
article_body.append(paragraphContext)
for qas in paragraph["qas"]:
question = qas["question"]
answer = qas["answers"][0]
fields.append({"question":question,"answer_text":answer["text"],"answer_start":answer["answer_start"],"paragraph_context":paragraphContext,"article_title":articleTitle})
article_body = "\\n".join(article_body)
article = {"title":articleTitle,"body":article_body}
articles.append(article)
fields = pd.DataFrame(fields)
fields["question"] = fields["question"].str.replace(regex," ")
assert not (fields["question"].str.contains("catalanswhat").any())
fields["paragraph_context"] = fields["paragraph_context"].str.replace(regex," ")
fields["answer_text"] = fields["answer_text"].str.replace(regex," ")
assert not (fields["paragraph_context"].str.contains("catalanswhat").any())
fields["article_title"] = fields["article_title"].str.replace("_"," ")
assert not (fields["article_title"].str.contains("catalanswhat").any())
return fields,JSON["data"]
trainingData,training_JSON = readFile("train-v1.1.json")
print("JSON dataset read.")
#Text preprocessing
## Converting text to skipgrams
print("Tokenizing sentences.")
strings = trainingData.drop("answer_start",axis=1)
strings = strings.values.flatten()
answer_start_train_one_hot = pd.get_dummies(trainingData["answer_start"])
# @title Keras-BERT Environment
import os
pretrained_path = 'uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')
# Use TF_Keras
os.environ["TF_KERAS"] = "1"
# @title Load Basic Model
import codecs
from keras_bert import load_trained_model_from_checkpoint
token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
for line in reader:
token = line.strip()
token_dict[token] = len(token_dict)
model = load_trained_model_from_checkpoint(config_path, checkpoint_path)
#@title Model Summary
model.summary()
#@title Create tokenization stuff.
from keras_bert import Tokenizer
tokenizer = Tokenizer(token_dict)
def tokenize(text,max_len):
tokenizer.tokenize(text)
return tokenizer.encode(first=text,max_len=max_len)
def tokenize_array(texts,max_len=512):
indices = np.zeros((texts.shape[0],max_len))
segments = np.zeros((texts.shape[0],max_len))
for i in range(texts.shape[0]):
tokens = tokenize(texts[i],max_len)
indices[i] = tokens[0]
segments[i] = tokens[1]
#print(indices.shape)
#print(segments.shape)
return np.stack([segments,indices],axis=1)
#@ Tokenize inputs.
def X_Y(dataset,answer_start_one_hot,batch_size=10):
questions = dataset["question"]
contexts = dataset["paragraph_context"]
questions_tokenized = tokenize_array(questions.values)
contexts_tokenized = tokenize_array(contexts.values)
X = np.stack([questions_tokenized,contexts_tokenized],axis=1)
Y = answer_start_one_hot
return X,Y
def X_Y_generator(dataset,answer_start_one_hot,batch_size=10):
while True:
try:
batch_indices = np.random.choice(np.arange(0,dataset.shape[0]),size=batch_size)
dataset_batch = dataset.iloc[batch_indices]
X,Y = X_Y(dataset_batch,answer_start_one_hot.iloc[batch_indices])
max_int = pd.concat((trainingData["answer_start"],devData["answer_start"])).max()
yield (X,Y)
except Exception as e:
print("Unhandled exception in X_Y_generator: ",e)
raise
model.trainable = True
answers_network_checkpoint = ModelCheckpoint('answers_network-best.h5', verbose=1, monitor='val_loss',save_best_only=True, mode='auto')
input_layer = Input(shape=(2,2,512,))
print("input layer: ",input_layer.shape)
questions_input_layer = Lambda(lambda x: x[:,0])(input_layer)
context_input_layer = Lambda(lambda x: x[:,1])(input_layer)
print("questions input layer: ",questions_input_layer.shape)
print("context input layer: ",context_input_layer.shape)
questions_indices_layer = Lambda(lambda x: tf.cast(x[:,0],tf.float64))(questions_input_layer)
print("questions indices layer: ",questions_indices_layer.shape)
questions_segments_layer = Lambda(lambda x: tf.cast(x[:,1],tf.float64))(questions_input_layer)
print("questions segments layer: ",questions_segments_layer.shape)
context_indices_layer = Lambda(lambda x: tf.cast(x[:,0],tf.float64))(context_input_layer)
context_segments_layer = Lambda(lambda x: tf.cast(x[:,1],tf.float64))(context_input_layer)
questions_bert_layer = model([questions_indices_layer,questions_segments_layer])
print("Questions bert layer loaded.")
context_bert_layer = model([context_indices_layer,context_segments_layer])
print("Context bert layer loaded.")
questions_flattened = Flatten()(questions_bert_layer)
context_flattened = Flatten()(context_bert_layer)
combined = Concatenate()([questions_flattened,context_flattened])
#bert_dense_questions = Dense(256,activation="sigmoid")(questions_flattened)
#bert_dense_context = Dense(256,activation="sigmoid")(context_flattened)
answers_network_output = Dense(1604,activation="softmax")(combined)
#answers_network = Model(inputs=[input_layer],outputs=[questions_bert_layer,context_bert_layer])
answers_network = Model(inputs=[input_layer],outputs=[answers_network_output])
answers_network.summary()
answers_network.compile("adam","categorical_crossentropy",metrics=["accuracy"])
answers_network.fit_generator(
X_Y_generator(
trainingData,
answer_start_train_one_hot,
batch_size=10),
steps_per_epoch=100,
epochs=100,
callbacks=[answers_network_checkpoint])
My vocabulary size is about 83,000 words. Any model with a "good" accuracy/F1 score is preferred, but I am also on a non-extensible deadline in 5 days.
EDIT:
Unfortunately, there was one thing I didn't mention: I am actually using CyberZHG's keras-bert module for preprocessing, and for the actual BERT model, so some optimizations may actually break the code. For example, I tried setting the default float value to float16, but this caused a compatibility error.
EDIT #2:
By request, here's the code for my full program:
Upvotes: 4
Views: 11736
Reputation: 10865
Check out this Out-of-memory issues section on their github page.
Often it's because that batch size or sequence length is too large to fit in the GPU memory, followings are the maximum batch configurations for a 12GB memory GPU, as listed in the above link
System | Seq Length | Max Batch Size
------------ | ---------- | --------------
`BERT-Base` | 64 | 64
... | 128 | 32
... | 256 | 16
... | 320 | 14
... | 384 | 12
... | 512 | 6
`BERT-Large` | 64 | 12
... | 128 | 6
... | 256 | 2
... | 320 | 1
... | 384 | 0
... | 512 | 0
Update
I see what you're doing here, this tensor with shape[786432,1604]
that causes the error is from the last layer Dense(1604,activation="softmax")(combined)
, where the first dimension 786432 = 768*1024 comes from concatenating the 768d bert features of two 512 sequences, the second dimension 1604
I suppose is for all the possible locations or intervals of the predicted answer.
However for sequence labeling tasks like SQUAD, people usually don't use such a big fully connected layer. Instead you can try applying the same weights for each position, then normalize the sequence outputs by softmax. This way you can reduce the number of parameters in the final layer from 768*1024*1604
to something like 768*2
, where the output dimension 2 is for predicting the start and end position of the answer.
There's an example from the bert github repo that shows how to do SQUAD for bert like models. Also there's a section in the BERT paper describing this.
Upvotes: 6
Reputation: 11333
Edit: I have edited my response in place rather than increasing the length of the already long response.
After looking at the issue rises from the final layer in your model. And I was able to get it to work with the following fixes/changes.
ResourceExhaustedError: OOM when allocating tensor with shape[786432,1604] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node dense_3/kernel/Initializer/random_uniform/RandomUniform}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
So, looking at the error the problem is not being able to allocate an array of [786432,1604]
. If you do a simple calculation you have 5GB
array allocated here (assuming float32). If it is float64
that goes to 10GB
. Add the parameters coming from Bert
and other layers in the model, viola! you run out of memory.
The issues
Looking at the code all these layers in your answer network are producing float64
because you are specifying float64
for all your Lambda
layers. So my first suggestion is,
tf.keras.backend.set_floatx('float16')
And as a precaution,
question_indices_layer = Input(shape=(256,), dtype='float16')
question_segments_layer = Input(shape=(256,), dtype='float16')
context_indices_layer = Input(shape=(256,), dtype='float16')
context_segments_layer = Input(shape=(256,), dtype='float16')
questions_bert_layer = model([question_indices_layer,question_segments_layer])
context_bert_layer = model([context_indices_layer,context_segments_layer])
questions_flattened = Flatten(dtype=tf.float16)(questions_bert_layer)
questions_flattened = Dense(64, activation='relu',dtype=tf.float16)(questions_flattened)
contexts_flattened = Flatten(dtype=tf.float16)(context_bert_layer)
contexts_flattened = Dense(64,activation="relu",dtype=tf.float16)
combined = Concatenate(dtype=tf.float16)([questions_flattened,contexts_flattened])
float16
.softmax
layerAnother thing you can do is, without passing a massive [batch size, 512, 768]
output to your dense layer, you squash it using a smaller layer or some transformation. Few things you can try are,
1604
softmax layer. This reduces the model parameters significantly.questions_flattened = Flatten(dtype=tf.float16)(questions_bert_layer)
questions_flattened = Dense(64, activation='relu',dtype=tf.float16)(questions_flattened)
contexts_flattened = Flatten(dtype=tf.float16)(context_bert_layer)
contexts_flattened = Dense(64,activation="relu",dtype=tf.float16)(contexts_flattened)
combined = Concatenate(dtype=tf.float16)([questions_flattened,contexts_flattened])
question
output. Because, you only care about understanding what the question is, so it would be fine to lose positional information from that output. You can do this the following way,questions_flattened = Lambda(lambda x: K.sum(x, axis=1))(questions_bert_layer)
Instead of Concatenate
try Add()
so that you don't increase the dimensionality.
You can try any of these (optional while combining with others in the list). But make sure you match dimensions of questions_flattend
and answers_flattened
when doing these in combination, as otherwise you'll get errors.
The next problem is that your input length is 512
. I'm not sure how you arrived at that number but I think you can do better well below that number. For example you get the following statistics for questions
and paragraphs
.
count 175198.000000
mean 11.217582
std 3.597345
min 1.000000
25% 9.000000
50% 11.000000
75% 13.000000
max 41.000000
Name: question, dtype: float64
count 175198.000000
mean 123.791653
std 50.541241
min 21.000000
25% 92.000000
50% 114.000000
75% 147.000000
max 678.000000
Name: paragraph_context, dtype: float64
You can get this information as,
pd.Series(trainingData["question"]).str.split(' ').str.len().describe()
As an example, when you pad your sequences using pad_sequences
you don't specify a maxlen
which leads to padding sentences to the maximum length found in the corpus. For example you'd have a 678 elements long paragraph context, where 75% of the data is under 150 words long.
I'm not exactly sure how these values play into the length 512
but I hope you get my point. From the looks of it it seems you can do fine with a length of 150
.
You can also reduce the vocabulary.
A good way of deciding this number would be to set the number of unique words that appear more than n
times in your corpus (n
can be 10-25 or better do some further analysis and find an optimal value.).
For example you can get vocabulary
stats as follows.
counts = sorted([(k, v) for k, v in list(textTokenizer.word_counts.items())], key=lambda x: x[1])
Which gives you word frequency combinations. You will see that around 37000 words appear less than (or approximately) 10 times. So you can set the vocabulary size of the tokenizer to something smaller.
textTokenizer = Tokenizer(num_words=50000, oov_token='unk')
But keep in mind that word_index
still contain all the words. So you need to make sure you remove these rare words when you pass it as token_dict
.
You seem to be setting batch_size=10
which should be fine. But to get better results (and hopefully with more memory once you do the above suggestions), go for a higher batch size like 32
or 64
, which will improve performance.
Upvotes: 9
Reputation: 23556
Your problem is when you create this Dense()
layer:
combined = Concatenate()([questions_flattened,context_flattened])
answers_network_output = Dense(1604,activation="softmax")(combined)
Concatenate()
gives you a huge layer, and when you connect that to Dense(1604, ...)
you get (786432,1604)
tensor, which is 1.2G-values (weight + bias, both floats), that will easily overflow your GPU memory.
To check if my assumption is correct, try to create layer:
answers_network_output = Dense(1604,activation="softmax")(something_smaller)
where something_smaller
is the layer of smaller size than concatenated
. Once you figure out this is your problem, you'll find the way to use less memory than you do now.
Upvotes: 0