meysam
meysam

Reputation: 83

How to Load a 4-bit Quantized VLM Model from Hugging Face with Transformers?

I’m new to quantization and working with visual language models (VLM).I’m trying to load a 4-bit quantized version of the Ovis1.6-Gemma model from Hugging Face using the transformers library. I downloaded the model from this link: https://huggingface.co/ThetaCursed/Ovis1.6-Gemma2-9B-bnb-4bit.

Here’s the code I’m using to load the model:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Define the quantization configuration
kwargs = {
    "quantization_config": BitsAndBytesConfig(
        load_in_4bit=True,
        load_in_8bit=False,
        bnb_4bit_compute_dtype="float32",
        bnb_4bit_quant_storage="uint8",
        bnb_4bit_quant_type="fp4",
        bnb_4bit_use_double_quant=False,
        llm_int8_enable_fp32_cpu_offload=False,
        llm_int8_has_fp16_weight=False,
        llm_int8_skip_modules=None,
        llm_int8_threshold=6.0
    )
}

model = AutoModelForCausalLM.from_pretrained(
    "ThetaCursed/Ovis1.6-Gemma2-9B-bnb-4bit",
    trust_remote_code=True,
    **kwargs
).cuda()

However, I am encountering the following warnings:

warnings.warn(_BETA_TRANSFORMS_WARNING)
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method'].
Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.06s/it]
You shouldn't move a model that is dispatched using accelerate hooks.

Additionally, when I try to access the tokenizers:

text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

I get the following error:

AttributeError: 'NoneType' object has no attribute 'get_text_tokenizer'

How can I properly load the 4-bit quantized model without encountering these warnings? Why am I receiving an AttributeError when trying to access the tokenizers? Does this model not support them?

Upvotes: 3

Views: 569

Answers (1)

Anirudh Senani
Anirudh Senani

Reputation: 11

I could not reproduce your issue, but I could load the model with the same code you are using with a few changes.

The quantization config is not needed here as it is already in the model config (check the config.json file in the link you mentioned).

Just make sure you have the latest hugging-face transformers library installed. Also, set the low_cpu_mem_usage parameter to True.

pip install -U transformers

Modified code:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "ThetaCursed/Ovis1.6-Gemma2-9B-bnb-4bit",
    trust_remote_code=True,
    low_cpu_mem_usage=True
).cuda()

Edit: I could also access the text and visual tokenizers without any issues.

These are the versions of the dependencies required on colab:

bitsandbytes                       0.44.1
safetensors                        0.4.5
tokenizers                         0.20.1
torch                              2.5.0+cu121
torchaudio                         2.5.0+cu121
torchsummary                       1.5.1
torchvision                        0.20.0+cu121
transformers                       4.46.0

Upvotes: 0

Related Questions