Reputation: 1
I would like some help from people who are experienced with LLaVA-NeXT(including NeXT-Interleave and OneVision)
These days I have been finetuning LLaVA-NeXT with my custom dataset but the loss stays at around 18 and does not decrease the during the training process.
script
`\`LLM_VERSION="mylesgoose/Llama-3.1-Minitron-4B-Width-Base"
LLM_VERSION_CLEAN="${LLM_VERSION////_}"
VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION////_}"
############### Pretrain ################
PROMPT_VERSION=llava_llama_3
BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"
CKPT_PATH=$LLM_VERSION
deepspeed llava/train/train_mem.py \
\--deepspeed scripts/zero3_new.json \
\--model_name_or_path ${CKPT_PATH} \
\--version ${PROMPT_VERSION} \
\--data_path ./playground/data/vqa_data.json \
\--image_folder ./playground/data \
\--pretrain_mm_mlp_adapter="./checkpoints/projectors/Llama-3.1-Minitron-4B-Width-Base/vision/mm_projector.bin" \
\--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
\--mm_vision_tower_lr=2e-6 \
\--vision_tower ${VISION_MODEL_VERSION} \
\--mm_projector_type mlp2x_gelu \
\--mm_vision_select_layer -2 \
\--mm_use_im_start_end False \
\--mm_use_im_patch_token False \
\--group_by_modality_length True \
\--image_aspect_ratio anyres_max_9 \
\--image_grid_pinpoints "(1x1),...,(6x6)" \
\--mm_patch_merge_type spatial_unpad \
\--fp16 True \
\--output_dir "./checkpoints/Llama-3.1-Minitron-4B-Width-Base" \
\--num_train_epochs 1 \
\--per_device_train_batch_size 2 \
\--per_device_eval_batch_size 4 \
\--gradient_accumulation_steps 32 \
\--evaluation_strategy "no" \
\--save_strategy "steps" \
\--save_steps 5000 \
\--save_total_limit 2 \
\--learning_rate 1e-5 \
\--weight_decay 0. \
\--warmup_ratio 0.03 \
\--lr_scheduler_type "cosine" \
\--logging_steps 1 \
\--tf32 False \
\--model_max_length 512 \
\--gradient_checkpointing True \
\--dataloader_num_workers 2 \
\--lazy_preprocess True \
\--report_to wandb \
\--torch_compile True \
\--torch_compile_backend "inductor" \
\--dataloader_drop_last True \
\--frames_upbound 32 \
\--attn_implementation sdpa \
\--run_name llavanext-siglip-400m-Meta-Llama-3.1-Minitron-4B-pretrain_blip558k_plain \``
train.py
def preprocess_llama3(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.",
) -\> Dict:
\# roles = {"human": "\<|start_header_id|\>user\<|end_header_id|\>", "gpt": "\<|start_header_id|\>assistant\<|end_header_id|\>"}
roles = {"human": "user", "gpt": "assistant"}
# Add image tokens to tokenizer as a special tokens
# Use a deepcopy of tokenizer so that we don't modify on the tokenizer
tokenizer = copy.deepcopy(tokenizer)
# When there is actually an image, we add the image tokens as a special token
if has_image:
tokenizer.add_tokens(["<image>"], special_tokens=True)
image_token_index = tokenizer.convert_tokens_to_ids("<image>")
bos_token_id = tokenizer.convert_tokens_to_ids("<|begin_of_text|>")
start_header_id = tokenizer.convert_tokens_to_ids("<|start_header_id|>")
end_header_id = tokenizer.convert_tokens_to_ids("<|end_header_id|>")
eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")
unmask_tokens = ["<|begin_of_text|>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "\n\n"]
unmask_tokens_idx = [tokenizer.convert_tokens_to_ids(tok) for tok in unmask_tokens]
# After update, calling tokenizer of llama3 will
# auto add bos id for the tokens. ヽ(`⌒´)ノ
def safe_tokenizer_llama3(text):
input_ids = tokenizer(text).input_ids
if input_ids[0] == bos_token_id:
input_ids = input_ids[1:]
return input_ids
nl_tokens = tokenizer.convert_tokens_to_ids("\n\n")
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
# New version, use apply chat template
# Build system message for each sentence
input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}])
target += [IGNORE_INDEX] * len(input_id)
for conv in source:
# Make sure llava data can load
try:
role = conv["role"]
content = conv["content"]
except:
role = conv["from"]
content = conv["value"]
role = roles.get(role, role)
conv = [{"role" : role, "content" : content}]
# First is bos token we don't need here
encode_id = tokenizer.apply_chat_template(conv)[1:]
input_id += encode_id
if role in ["user", "system"]:
target += [IGNORE_INDEX] * len(encode_id)
else:
target += encode_id
assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
for idx, encode_id in enumerate(input_id):
if encode_id in unmask_tokens_idx:
target[idx] = encode_id
if encode_id == image_token_index:
input_id[idx] = IMAGE_TOKEN_INDEX
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
)
\
conversation.py
conv_llava_llama_3 = Conversation( system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.", roles=("user", "assistant"), version="llama_v3", messages=[], offset=0, sep="<|eot_id|>", sep_style=SeparatorStyle.LLAMA_3, tokenizer_id="mylesgoose/Llama-3.`your text`1-Minitron-4B-Width-Base", # tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct", tokenizer=safe_load_tokenizer("mylesgoose/Llama-3.1-Minitron-4B-Width-Base"), # tokenizer=safe_load_tokenizer("meta-llama/Meta-Llama-3-8B-Instruct"), stop_token_ids=[128009, 128008, 128001], # stop_token_ids=[128009], )
Those are the script and its associated code snippets, and I used the LLM from mylesgoose/Llama-3.1-Minitron-4B-Width-Base.
My server uses 2 NVIDIA RTX 6000 GPUs, 240 GiB RAM, and 20 CPU cores.
What would have contributed to the problem with the training? I am guessing I still miss something with the tokenizations and paddings. I will share more information such as 'tokenizer_config.json' file if needed.
Following the guide from the model owner's guide, I simply edited the 'conv_llava_llama_3' variable to make it able to process LLaMA-3.1, as you see in the "conversation.py" snippet because I was told it would be the easiest way to use the model. I was not really instructed to change anything in "train.py".
Upvotes: 0
Views: 140