O.Sahin
O.Sahin

Reputation: 21

BERT Pre-Training MLM + NSP

I want to pre-train BERT for the tasks MLM + NSP. When I run the code below, threw me an error:

RuntimeError: The size of tensor a (882) must match the size of tensor b (512) at non-singleton dimension 1 1%|▊ | 3/561 [00:02<06:13, 1.49it/s]

It looks like a truncation problem. But why? I just used libraries. If someone can enlighten me, I would be happy. Thanks for advance.

The code I run:

from transformers import BertTokenizer
from transformers import BertConfig, BertForPreTraining
from transformers import TextDatasetForNextSentencePrediction
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

TOKENIZER_PATH = "hukuk_tokenizer"
MAX_LEN = 512
BLOCK_SIZE = 128
DATA_PATH = "data/toy_sentences_v3.removed_long_sent.txt"
OUTPUT_DIR = "/home/osahin/bert_yoktez/results/"
config = BertConfig()

if TOKENIZER_PATH == "hukuk_tokenizer":

        config.update({"vocab_size":30000})


print("config: ",config)

tokenizer = BertTokenizer.from_pretrained(TOKENIZER_PATH)
tokenizer.model_max_length= MAX_LEN
print("Tokenizer: ",tokenizer)

model = BertForPreTraining(config)

dataset= TextDatasetForNextSentencePrediction(
    tokenizer=tokenizer,
    file_path=DATA_PATH,
    block_size = BLOCK_SIZE
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability= 0.15
)

training_args = TrainingArguments(
    output_dir= OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size= 32,
    save_steps=1000,
    save_on_each_node=True,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()

NOTE: For the NSP task, the input file was prepared by a sentence per line.

Upvotes: 0

Views: 1358

Answers (2)

rohanneps
rohanneps

Reputation: 101

The error occurs because the number of tokens obtained after tokenization has exceeded the max token capacity(512) of the model.

A manual tokenization step can be injected in the pipeline once the TextDatasetForNextSentencePrediction object has been created:


ending_sep_token_tensor = torch.tensor([102])

for sample in dataset.examples:
    if len(sample['input_ids'])>MAX_LEN:
        sample['input_ids'] = torch.cat((sample['input_ids'][:MAX_LEN-1], ending_sep_token_tensor), 0)
        sample['token_type_ids'] = sample['token_type_ids'][:MAX_LEN]

Upvotes: 0

David Dale
David Dale

Reputation: 11434

The error The size of tensor a (882) must match the size of tensor b (512) at non-singleton dimension most probably means that the maximal text size that the model supports is 512 tokens, but you try to pass a text with 882 tokens to it. To bypass this, you can enable truncation somewhere in your pipeline (most probably, at the moment of text tokenization, i.e. within TextDatasetForNextSentencePrediction or immediately after its creation).

Upvotes: 1

Related Questions