Reputation: 1
I am trying to train a LLavaOneVision model, unfortunately, I am not being able to understand how to process the labels to put it into the model for fine tuning the LLavaOneVision( https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf ). The collate function,the model's forward function and the training step is given below is given below.
def collate_fn(self, batch):
images = []
texts = []
answers = []
for example in batch:
question, answer, rgb_image_np = example
images.append(rgb_image_np)
answers.append(answer)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": question}, # Add question text
{"type": "image"}, # Add the image as a type
],
},
{
"role": "assistant",
"content": [
{"type": "text", "text": answer}, # Add the assistant's answer
],
}
]
text_prompt = self.processor.apply_chat_template(conversation)
texts.append(text_prompt)
# Prepare inputs (image + text)
model_inputs = self.processor(
images=images,
text=texts,
return_tensors="pt",
padding=True
).to(torch.float16)
# Prepare labels
labels = ??? # How to prepare the labels?
# Add labels to the batch dictionary
return {
"input_ids": model_inputs["input_ids"],
"labels": labels
}
class LlavaOnevisionModule(pl.LightningModule):
def __init__(self, model_name, processor, learning_rate=2e-5):
super().__init__()
self.model_name = model_name
self.learning_rate = learning_rate
self.processor = processor
self.model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_name,
low_cpu_mem_usage=True,
# use_flash_attention_2=True,
# torch_dtype=torch.float16
)
self.config = self.model.config
self.pad_token_id = (self.processor.tokenizer.eos_token_id
if self.processor.tokenizer.pad_token_id is None
else self.processor.tokenizer.pad_token_id)
def forward(self, input_ids, labels):
# Create a dictionary for inputs to match the expected input format of the model
inputs = {
'input_ids': input_ids.to(self.device),
'labels': labels.to(self.device) # Move labels to the correct device
}
# Pass the inputs through the model (which expects input_ids and labels)
outputs = self.model(**inputs)
return outputs
def training_step(self, batch, batch_idx):
# Unpack the batch
input_ids = batch['input_ids']
labels = batch['labels']
# Forward pass
outputs = self(input_ids=input_ids, labels=labels)
# Calculate loss
loss = outputs.loss
# Log training loss
self.log('train_loss', loss)
return loss
I tried cloning the input_ids and passing it as the labels. I am not sure if thats the correct way. I have also heard that I might have to do a right shifting of labels, however I believe thats already implemented inside the model's own forward function. I have also tried using the processor.tokenizer(text=answers,return_tensors="pt",padding=True,return_token_type_ids=False)
. However this also returns saying something like the input_id length and the label length are not equal.
How do I process the labels then?
Upvotes: 0
Views: 50