How to make the predictions by loading the context once and predict the answer dynamically with regards to question in BERT neural network model?

Question

I have created a workflow from asking the user to post a question to make a prediction for the question. I am using BERT neural network model for prediction and used SQUAD 2.0 for training using TPU. When I load a paragraph or two in the context in following JSON structure:

{
  "data": [
    {
      "paragraphs": [
        {
          "qas": [
            {
              "question": "question",
              "id": "65432sd54654dadaad"
            }
          ],
          "context": "paragraph"
        }
      ]
    }
  ]
}

and send this to predict the answer, it takes a minute for each individual questions. The followings are the things I've noticed: The Context and question is converted 1 0 or True False Then prediction starts. Prediction takes about 20 seconds or lesser.

If I try to add 5MB of text into context, it takes two full hours to convert to 1 0 or true-false then predicts the answer.

Is it possible to load the context once and predict the answer dynamically with regards to question? I use run_squad.py. These are the flags I used:

python run_squad.py \
  --vocab_file=$BERT_LARGE_DIR/vocab.txt \
  --bert_config_file=$BERT_LARGE_DIR/bert_config.json \
  --init_checkpoint=$BERT_LARGE_DIR/model.ckpt \
  --do_train=False \
  --train_file=$SQUAD_DIR/train-v2.0.json \
  --do_predict=True \
  --predict_file=$SQUAD_DIR/dev-v2.0.json \
  --train_batch_size=24 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --output_dir=gs://some_bucket/squad_large/ \
  --use_tpu=True \
  --tpu_name=$TPU_NAME \
  --version_2_with_negative=True

tomkot · Accepted Answer

To my understanding, this is not possible. When the paragraph is too long to fit in a single input sequence, BERT uses a sliding window approach. Therefore one question and paragraph pair may give raise to many inputs to the BERT model. Each of the inputs consists of the query concatenated with a sliding window (which is a subsequence of the paragraph). An embedding is computed for this input followed by a few layers specific to SQUAD. Importantly, this is one BERT embedding for the query and subsequence of the paragraph both. This means that, technically, computing the embedding of the context alone one time does not work here.

Conceptually, the attention layers of the BERT model can decide which tokens in the paragraph to attend to based on the query, and vice versa. This gives considerable power to the model, rather than having to decide where to attend to in the paragraph before knowing the query.

How to make the predictions by loading the context once and predict the answer dynamically with regards to question in BERT neural network model?

Answers (1)

Related Questions