How to get the labels for my LLavaOneVision model?

Question

I am trying to train a LLavaOneVision model, unfortunately, I am not being able to understand how to process the labels to put it into the model for fine tuning the LLavaOneVision( https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf ). The collate function,the model's forward function and the training step is given below is given below.

def collate_fn(self, batch):
    images = []
    texts = []
    answers = []
    for example in batch:
        question, answer, rgb_image_np = example
        images.append(rgb_image_np)
        answers.append(answer)
        conversation = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},  # Add question text
                    {"type": "image"},                  # Add the image as a type
                ],
            },
            {
                "role": "assistant",
                "content": [
                    {"type": "text", "text": answer},   # Add the assistant's answer
                ],
            }
        ]
        text_prompt = self.processor.apply_chat_template(conversation)
        texts.append(text_prompt)

    # Prepare inputs (image + text)
    model_inputs = self.processor(
            images=images,
            text=texts,
            return_tensors="pt",
            padding=True
        ).to(torch.float16)

    # Prepare labels
    labels = ???  # How to prepare the labels?

    # Add labels to the batch dictionary
    return {
            "input_ids": model_inputs["input_ids"],
            "labels": labels
        }

class LlavaOnevisionModule(pl.LightningModule):
    def __init__(self, model_name, processor, learning_rate=2e-5):
        super().__init__()
        self.model_name = model_name
        self.learning_rate = learning_rate
        
        self.processor = processor
        self.model = LlavaOnevisionForConditionalGeneration.from_pretrained(
            model_name, 
            low_cpu_mem_usage=True,
            # use_flash_attention_2=True,

            # torch_dtype=torch.float16
        )

        self.config = self.model.config
        
        self.pad_token_id = (self.processor.tokenizer.eos_token_id 
                             if self.processor.tokenizer.pad_token_id is None 
                             else self.processor.tokenizer.pad_token_id)

    def forward(self, input_ids, labels):
        # Create a dictionary for inputs to match the expected input format of the model
        inputs = {
            'input_ids': input_ids.to(self.device),
            'labels': labels.to(self.device)  # Move labels to the correct device
        }

        # Pass the inputs through the model (which expects input_ids and labels)
        outputs = self.model(**inputs)

        return outputs


    def training_step(self, batch, batch_idx):
        # Unpack the batch
        input_ids = batch['input_ids']
        labels = batch['labels']
        
        # Forward pass
        outputs = self(input_ids=input_ids, labels=labels)
        
        # Calculate loss
        loss = outputs.loss
        
        # Log training loss
        self.log('train_loss', loss)
        return loss

I tried cloning the input_ids and passing it as the labels. I am not sure if thats the correct way. I have also heard that I might have to do a right shifting of labels, however I believe thats already implemented inside the model's own forward function. I have also tried using the processor.tokenizer(text=answers,return_tensors="pt",padding=True,return_token_type_ids=False). However this also returns saying something like the input_id length and the label length are not equal.

How do I process the labels then?

How to get the labels for my LLavaOneVision model?

Answers (0)

Related Questions