Mr. Engineer
Mr. Engineer

Reputation: 375

ValueError when pre-training BERT model using Trainer API

I'm trying to fine-tune/pre-train an existing BERT model for sentiment analysis by using Trainer API in transformers library. My training dataset looks like this:

Text                             Sentiment
This was good place                  1
This was bad place                   0

My goal is to be able to classify sentiments as positive/negative. And here is my code:

from datasets import load_dataset
from datasets import load_dataset_builder
import datasets
import transformers
from transformers import TrainingArguments
from transformers import Trainer

dataset = load_dataset('csv', data_files='my_data.csv', sep=';')
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-cased")
model = transformers.BertForMaskedLM.from_pretrained("bert-base-cased") 
print(dataset)
def tokenize_function(examples):
    return tokenizer(examples["Text"], examples["Sentiment"], truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments("test_trainer")
trainer = Trainer(
    model=model, args=training_args, train_dataset=tokenized_datasets
)
trainer.train()

This throws error message:

ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

What I'm doing wrong? Any advise is highly appreciated.

Upvotes: 1

Views: 7219

Answers (1)

g_dzt
g_dzt

Reputation: 1478

There are several points here to which you need to pay attention in order to have your code working.

First of all, you are working on a sequence classification task, specifically a binary classification, so you need to instantiate your model accordingly:

# replace this:
# model = transformers.BertForMaskedLM.from_pretrained("bert-base-cased")
# by this:
model = transformers.BertForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

You shouldn't provide the labels (examples["Sentiment"]) to the tokenizer, as they don't need to be tokenized:

# use only the text as input 
# use padding to standardize sequence length
return tokenizer(examples["Text"], truncation=True, padding='max_length')

Speaking of labels, your trainer will expect them to be in a column named 'label', so you have to rename your 'Sentiment' accordingly. Note that this method doesn't operate in-place as you could expect, it returns a new dataset that you have to capture.

# for example, after you tokenized the dataset:
tokenized_datasets = tokenized_datasets.rename_column('Sentiment', 'label')

Finally, you need to specify the split of the dataset you actually want to use for training. Here, since you did not split the dataset, it should contain only one: 'train'

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=tokenized_datasets['train'] # here
)

That should make your code work, but doesn't mean you'll get any interesting result. As you're interested in working with transformers, I strongly recommend you have a look at the series of notebooks by huggingface.

Upvotes: 1

Related Questions