The loss does not decrease while finetuning LLaVA-NeXT with my custom dataset

Question

I would like some help from people who are experienced with LLaVA-NeXT(including NeXT-Interleave and OneVision)

These days I have been finetuning LLaVA-NeXT with my custom dataset but the loss stays at around 18 and does not decrease the during the training process.

script

`\`LLM_VERSION="mylesgoose/Llama-3.1-Minitron-4B-Width-Base"
LLM_VERSION_CLEAN="${LLM_VERSION////_}"
VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION////_}"

############### Pretrain ################

PROMPT_VERSION=llava_llama_3

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

CKPT_PATH=$LLM_VERSION

deepspeed llava/train/train_mem.py \
\--deepspeed scripts/zero3_new.json \
\--model_name_or_path ${CKPT_PATH} \
\--version ${PROMPT_VERSION} \
\--data_path ./playground/data/vqa_data.json \
\--image_folder ./playground/data \
\--pretrain_mm_mlp_adapter="./checkpoints/projectors/Llama-3.1-Minitron-4B-Width-Base/vision/mm_projector.bin" \
\--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
\--mm_vision_tower_lr=2e-6 \
\--vision_tower ${VISION_MODEL_VERSION} \
\--mm_projector_type mlp2x_gelu \
\--mm_vision_select_layer -2 \
\--mm_use_im_start_end False \
\--mm_use_im_patch_token False \
\--group_by_modality_length True \
\--image_aspect_ratio anyres_max_9 \
\--image_grid_pinpoints "(1x1),...,(6x6)" \
\--mm_patch_merge_type spatial_unpad \
\--fp16 True \
\--output_dir "./checkpoints/Llama-3.1-Minitron-4B-Width-Base" \
\--num_train_epochs 1 \
\--per_device_train_batch_size 2 \
\--per_device_eval_batch_size 4 \
\--gradient_accumulation_steps 32 \
\--evaluation_strategy "no" \
\--save_strategy "steps" \
\--save_steps 5000 \
\--save_total_limit 2 \
\--learning_rate 1e-5 \
\--weight_decay 0. \
\--warmup_ratio 0.03 \
\--lr_scheduler_type "cosine" \
\--logging_steps 1 \
\--tf32 False \
\--model_max_length 512 \
\--gradient_checkpointing True \
\--dataloader_num_workers 2 \
\--lazy_preprocess True \
\--report_to wandb \
\--torch_compile True \
\--torch_compile_backend "inductor" \
\--dataloader_drop_last True \
\--frames_upbound 32 \
\--attn_implementation sdpa \
\--run_name llavanext-siglip-400m-Meta-Llama-3.1-Minitron-4B-pretrain_blip558k_plain \``

train.py

def preprocess_llama3(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.",
) -\> Dict:
\# roles = {"human": "\<|start_header_id|\>user\<|end_header_id|\>", "gpt": "\<|start_header_id|\>assistant\<|end_header_id|\>"}
roles = {"human": "user", "gpt": "assistant"}

    # Add image tokens to tokenizer as a special tokens
    # Use a deepcopy of tokenizer so that we don't modify on the tokenizer
    tokenizer = copy.deepcopy(tokenizer)
    # When there is actually an image, we add the image tokens as a special token
    if has_image:
        tokenizer.add_tokens([""], special_tokens=True)
    image_token_index = tokenizer.convert_tokens_to_ids("")
    bos_token_id = tokenizer.convert_tokens_to_ids("<|begin_of_text|>")
    start_header_id = tokenizer.convert_tokens_to_ids("<|start_header_id|>")
    end_header_id = tokenizer.convert_tokens_to_ids("<|end_header_id|>")
    eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")
    
    unmask_tokens = ["<|begin_of_text|>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "

"]
    unmask_tokens_idx = [tokenizer.convert_tokens_to_ids(tok) for tok in unmask_tokens]
    
    # After update, calling tokenizer of llama3 will
    # auto add bos id for the tokens. ヽ(｀⌒´)ﾉ
    def safe_tokenizer_llama3(text):
        input_ids = tokenizer(text).input_ids
        if input_ids[0] == bos_token_id:
            input_ids = input_ids[1:]
        return input_ids
    
    nl_tokens = tokenizer.convert_tokens_to_ids("

")
    # Apply prompt templates
    input_ids, targets = [], []
    for i, source in enumerate(sources):
        if roles[source[0]["from"]] != roles["human"]:
            source = source[1:]
    
        input_id, target = [], []
    
        # New version, use apply chat template
        # Build system message for each sentence
        input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}])
        target += [IGNORE_INDEX] * len(input_id)
    
        for conv in source:
            # Make sure llava data can load
            try:
                role = conv["role"]
                content = conv["content"]
            except:
                role = conv["from"]
                content = conv["value"]
    
            role =  roles.get(role, role)
            
            conv = [{"role" : role, "content" : content}]
            # First is bos token we don't need here
            encode_id = tokenizer.apply_chat_template(conv)[1:]
            input_id += encode_id
            if role in ["user", "system"]:
                target += [IGNORE_INDEX] * len(encode_id)
            else:
                target += encode_id
        
    
                    
        assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
        for idx, encode_id in enumerate(input_id):
            if encode_id in unmask_tokens_idx:
                target[idx] = encode_id
            if encode_id == image_token_index:
                input_id[idx] = IMAGE_TOKEN_INDEX
        input_ids.append(input_id)
        targets.append(target)
    input_ids = torch.tensor(input_ids, dtype=torch.long)
    targets = torch.tensor(targets, dtype=torch.long)
    
    return dict(
        input_ids=input_ids,  # tensor(bs x seq_len)
        labels=targets,  # tensor(bs x seq_len)
    )

\

conversation.py

conv_llava_llama_3 = Conversation(     system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",     roles=("user", "assistant"),     version="llama_v3",     messages=[],     offset=0,     sep="<|eot_id|>",     sep_style=SeparatorStyle.LLAMA_3,     tokenizer_id="mylesgoose/Llama-3.`your text`1-Minitron-4B-Width-Base",  # tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",     tokenizer=safe_load_tokenizer("mylesgoose/Llama-3.1-Minitron-4B-Width-Base"),  # tokenizer=safe_load_tokenizer("meta-llama/Meta-Llama-3-8B-Instruct"),     stop_token_ids=[128009, 128008, 128001],  # stop_token_ids=[128009], )

Those are the script and its associated code snippets, and I used the LLM from mylesgoose/Llama-3.1-Minitron-4B-Width-Base.

My server uses 2 NVIDIA RTX 6000 GPUs, 240 GiB RAM, and 20 CPU cores.

What would have contributed to the problem with the training? I am guessing I still miss something with the tokenizations and paddings. I will share more information such as 'tokenizer_config.json' file if needed.

Following the guide from the model owner's guide, I simply edited the 'conv_llava_llama_3' variable to make it able to process LLaMA-3.1, as you see in the "conversation.py" snippet because I was told it would be the easiest way to use the model. I was not really instructed to change anything in "train.py".

The loss does not decrease while finetuning LLaVA-NeXT with my custom dataset

Answers (0)

Related Questions