Reputation: 81
I'm attempting to fine-tune gpt-j using the huggingface trainer and failing miserably. I followed the example that references bert, but of course, the gpt-j model isn't exactly like the bert model.
The error indicates that the model isn't producing a loss, which is great, except that I have no idea how to make it generate a loss or how to change what the trainer is expecting.
I'm using Transformers 4.22.2. I would like to get this working on a CPU before I try to do anything on Paperspace with a GPU. I did make an initial attempt there using a GPU that received the same error, with slightly different code to use cuda.
I suspect that my approach is entirely wrong. I found a very old example of fine-tuning gpt-j using 8-bit quantization, but even that repository says it is deprecated.
I'm unsure if my mistake is in using the compute_metrics() I found in the bert example or if it is something else. Any advice would be appreciated. Or, maybe it is an issue with the labels I provide the config, but I've tried different permutations.
I understand what a loss function is, but I don't know how it is supposed to be configured in this case.
My Code:
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM
from transformers import GPTJForCausalLM, AutoTokenizer
from datasets import load_dataset
import time
import torch
import os
import numpy as np
import evaluate
import sklearn
start = time.time()
GPTJ_FINE_TUNED_FILE = "./fine_tuned_models/gpt-j-6B"
print("Loading model")
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", low_cpu_mem_usage=True)
model.config.pad_token_id = model.config.eos_token_id
print("Loading tokenizer")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.pad_token = tokenizer.eos_token
print("Loading dataset")
current_dataset = load_dataset("wikitext", 'wikitext-103-v1')
current_dataset['train'] = current_dataset['train'].select(range(1200))
def tokenize_function(examples):
current_tokenizer_result = tokenizer(examples["text"], padding="max_length", truncation=True)
return current_tokenizer_result
print("Splitting and tokenizing dataset")
tokenized_datasets = current_dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].select(range(100))
print("Preparing training arguments")
training_args = TrainingArguments(output_dir=GPTJ_FINE_TUNED_FILE,
report_to='all',
logging_dir='./logs',
per_device_train_batch_size=1,
label_names=['input_ids', 'attention_mask'], # 'logits', 'past_key_values'
num_train_epochs=1,
no_cuda=True
)
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset
)
print("Starting training")
trainer.train()
print(f"Finished fine-tuning in {time.time() - start}")
Which leads to the error and stacktrace:
File "xxx\ft_v3.py", line 66, in <module>
File "xxx\venv\lib\site-packages\transformers\trainer.py", line 1521, in train
return inner_training_loop(
File "xxx\venv\lib\site-packages\transformers\trainer.py", line 1763, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "xxx\venv\lib\site-packages\transformers\trainer.py", line 2499, in training_step
loss = self.compute_loss(model, inputs)
File "xxx\venv\lib\site-packages\transformers\trainer.py", line 2544, in compute_loss
raise ValueError(
ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask.
Upvotes: 4
Views: 3683
Reputation: 81
I found what appears to work, though now I'm running low on memory and working through ways of handling it.
The data_collator parameter seems to take care of the exact issue that I was having.
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
data_collator=data_collator,
)
Upvotes: 1