Stimmot
Stimmot

Reputation: 1269

"ValueError: You have to specify either input_ids or inputs_embeds" when training AutoModelWithLMHead Model (GPT-2)

I want to fine-tune the AutoModelWithLMHead model from this repository, which is a German GPT-2 model. I have followed the tutorials for pre-processing and fine-tuning. I have prepocessed a bunch of text passages for the fine-tuning, but when beginning training, I receive the following error:

File "GPT\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "GPT\lib\site-packages\transformers\models\gpt2\modeling_gpt2.py", line 774, in forward
    raise ValueError("You have to specify either input_ids or inputs_embeds")
ValueError: You have to specify either input_ids or inputs_embeds

Here is my code for reference:

# Load data
with open("Fine-Tuning Dataset/train.txt", "r", encoding="utf-8") as train_file:
    train_data = train_file.read().split("--")

with open("Fine-Tuning Dataset/test.txt", "r", encoding="utf-8") as test_file:
    test_data = test_file.read().split("--")

# Load pre-trained tokenizer and prepare input
tokenizer = AutoTokenizer.from_pretrained('dbmdz/german-gpt2')

tokenizer.pad_token = tokenizer.eos_token
train_input = tokenizer(train_data, padding="longest")
test_input = tokenizer(test_data, padding="longest")

# Define model

model = AutoModelWithLMHead.from_pretrained("dbmdz/german-gpt2")
training_args = TrainingArguments("test_trainer")


# Evaluation

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = numpy.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_input,
    eval_dataset=test_input,
    compute_metrics=compute_metrics,
)
trainer.train()
trainer.evaluate()

Does anyone know the reason for this? Any help is welcome!

Upvotes: 0

Views: 3853

Answers (2)

Koren Lazar
Koren Lazar

Reputation: 1

The train_dataset and the eval_dataset expect an object of type torch.utils.data.Dataset or torch.utils.data.IterableDataset. You can for example load the data with Huggingface's datasets library and process it in the following way:

import datasets
data = datasets.load_dataset("text", data_files={"train": "Fine-Tuning Dataset/train.txt", "test":"Fine-Tuning Dataset/test.txt"})

def tokenize_function(element):
    return tokenizer(element, padding="longest")

tokenized_data = dataset.map(tokenize_function, batched=True)

Now, the following should work (along with the other code you attached):

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['test'],
    compute_metrics=compute_metrics,
)
trainer.train()
trainer.eval()

Upvotes: 0

Stimmot
Stimmot

Reputation: 1269

I didn't find the concrete answer to this question, but a workaround. For anyone looking for examples on how to fine-tune the GPT models from HuggingFace, you may have a look into this repo. They listed a couple of examples on how to fine-tune different Transformer models, complemented by documented code examples. I used the run_clm.py script and it achieved what I wanted.

Upvotes: 0

Related Questions