Prediction takes more time than training for roberta model with TPU

Question

I am trying to train a roberta model from scratch on a custom data in google colab. I am able to get the training done using TPU and I could see significant reduction in the training time (it take hours in CPU but gets done in couple of minutes with TPU). But the model prediction is taking more time than training. For a data 1/4th size of training data the prediction time is more than double implying the prediction training is 10x time faster than prediction. I was expecting prediction to be faster. Am I missing something?

Given below is simplied version of the code. I tired changing input format (numpy, tensors), batch size etc but it didn't help. The fact training is faster seems to indicate the TPU setup is fine and I am also using same dataset to test prediction so there is nothing wrong in data as well. So wondering what is causing the slow down in predicton.

from transformers import RobertaConfig
from transformers import RobertaForMaskedLM
from transformers import TFRobertaForMaskedLM
import tensorflow as tf

/usr/local/lib/python3.11/dist-packages/torch_xla/init.py:253: UserWarning: tensorflow can conflict with torch-xla. Prefer tensorflow-cpu when using PyTorch/XLA. To silence this warning, pip uninstall -y tensorflow && pip install tensorflow-cpu. If you are in a notebook environment such as Colab or Kaggle, restart your notebook runtime afterwards. warnings.warn(

resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
print("replicas",strategy.num_replicas_in_sync)
batch_size = 8 * strategy.num_replicas_in_sync
print(strategy)

replicas 8

config = RobertaConfig(
    vocab_size=10_000,
    max_position_embeddings=64,
    num_attention_heads=6,
    num_hidden_layers=3,
    type_vocab_size=1,
    hidden_size=300,
    intermediate_size=600
)
with strategy.scope():
  model = TFRobertaForMaskedLM(config=config) #Tensorflow Roberta Model
  model.compile(
      optimizer=tf.keras.optimizers.AdamW(learning_rate=5e-05)
  )

Read tokenized and masked data

import tensorflow as tf
def decode_fn(sample):
    features = {
        "input_ids": tf.io.FixedLenFeature((64,), dtype=tf.int64),
        "attention_mask": tf.io.FixedLenFeature((64,), dtype=tf.int64),
        "labels": tf.io.FixedLenFeature((64,), dtype=tf.int64)
    }
    return tf.io.parse_example(sample, features)
tf_dataset = tf.data.TFRecordDataset(["dataset.tfrecords"])
tf_dataset = tf_dataset.map(decode_fn)
tf_dataset = tf_dataset.batch(batch_size, drop_remainder=True)
tf_dataset = tf_dataset.apply(
    tf.data.experimental.assert_cardinality(263317 // 64))
predict_check_dataset = tf_dataset.take(1000)

print("Number of training Batch:",len(list(tf_dataset)))
print("Number of prediction Batch:",len(list(predict_check_dataset)))

Number of training Batch: 4114 Number of prediction Batch: 1000

train_log = model.fit(tf_dataset)

4114/4114 [==============================] - 94s 23ms/step - loss: 1.5873

for batch in predict_check_dataset:
        prediction = model(batch['input_ids'], attention_mask=batch['attention_mask'],training=False)

The prediction takes about 240 seconds compared though it is 1/4th size of training. Training took 94 seconds

Prediction takes more time than training for roberta model with TPU

Answers (0)

Related Questions