Reputation: 89
I am trying to train a transformer(Salesforce codet5-small) using the huggingface trainer method and on a hugging face Dataset (namely, "eth_py150_open"). However, I'm encountering a number of issues.
Here is the relevant code snippet:
import torch
import transformers
from datasets import load_dataset_builder
from datasets import load_dataset
corpus=load_dataset("eth_py150_open", split='train')
training_args = transformers.TrainingArguments( #general training arguments
per_device_train_batch_size = 8,
warmup_steps = 0,
weight_decay = 0.01,
learning_rate = 1e-4,
num_train_epochs = 12,
output_dir = './runs/run2/output/',
logging_dir = './runs/run2/logging/',
logging_steps = 50,
save_steps= 10000,
remove_unused_columns=False,
)
model = transformers.T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-small').cuda()
trainer = transformers.Trainer(
model = model,
args = training_args,
train_dataset = corpus,
)
However, when running trainer.train(), I get the following error:
***** Running training *****
Num examples = 74749
Num Epochs = 12
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 112128
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-3435b262f1ae> in <module>
----> 1 trainer.train()
3 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in _prepare_inputs(self, inputs)
2414 if len(inputs) == 0:
2415 raise ValueError(
-> 2416 "The batch received was empty, your model won't be able to train on it. Double-check that your "
2417 f"training dataset contains keys expected by the model: {','.join(self._signature_columns)}."
2418 )
TypeError: can only join an iterable
I have tried converting corpus to a torch Dataset object, but can't seem to figure out how to do this. I'd really appreciate any help!
Upvotes: 1
Views: 2508
Reputation: 96
You need to tokenize the dataset before you can pass it to the model. Below I have added a preprocess()
function to tokenize. You'll also need a data_collator
to collate the tokenized sequences. Since T5 is a seq2seq model I am guessing you are trying to generate the license string hence I have replaced Trainer
with Seq2SeqTrainer
. (Although I think it would be better if you consider this a sequence classification task). Here's your updated script.
import torch
import transformers
from datasets import load_dataset
from tokenizers import AutoTokenizer
corpus=load_dataset("eth_py150_open", split='train')
model = transformers.T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-small').cuda()
tokenizer = AutoTokenizer.from_pretrained('Salesforce/codet5-small')
def preprocess(examples):
model_inputs = tokenizer(examples['filepath'], truncation=True)
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples['license'], truncation=True)
model_inputs['labels'] = labels['input_ids']
return model_inputs
tokenized_dataset = corpus.map(preprocess_function, batched=True)
training_args = transformers.Seq2SeqTrainingArguments( #general training arguments
per_device_train_batch_size = 8,
warmup_steps = 0,
weight_decay = 0.01,
learning_rate = 1e-4,
num_train_epochs = 12,
output_dir = './runs/run2/output/',
logging_dir = './runs/run2/logging/',
logging_steps = 50,
save_steps= 10000,
remove_unused_columns=False,
)
data_collator = transformers.DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = transformers.Seq2SeqTrainer(
model = model,
args = training_args,
train_dataset = tokenized_dataset,
data_collator=data_collator
)
Upvotes: 0