How to Load a 4-bit Quantized VLM Model from Hugging Face with Transformers?

Question

I’m new to quantization and working with visual language models (VLM).I’m trying to load a 4-bit quantized version of the Ovis1.6-Gemma model from Hugging Face using the transformers library. I downloaded the model from this link: https://huggingface.co/ThetaCursed/Ovis1.6-Gemma2-9B-bnb-4bit.

Here’s the code I’m using to load the model:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Define the quantization configuration
kwargs = {
    "quantization_config": BitsAndBytesConfig(
        load_in_4bit=True,
        load_in_8bit=False,
        bnb_4bit_compute_dtype="float32",
        bnb_4bit_quant_storage="uint8",
        bnb_4bit_quant_type="fp4",
        bnb_4bit_use_double_quant=False,
        llm_int8_enable_fp32_cpu_offload=False,
        llm_int8_has_fp16_weight=False,
        llm_int8_skip_modules=None,
        llm_int8_threshold=6.0
    )
}

model = AutoModelForCausalLM.from_pretrained(
    "ThetaCursed/Ovis1.6-Gemma2-9B-bnb-4bit",
    trust_remote_code=True,
    **kwargs
).cuda()

However, I am encountering the following warnings:

warnings.warn(_BETA_TRANSFORMS_WARNING)
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method'].
Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.06s/it]
You shouldn't move a model that is dispatched using accelerate hooks.

Additionally, when I try to access the tokenizers:

text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

I get the following error:

AttributeError: 'NoneType' object has no attribute 'get_text_tokenizer'

How can I properly load the 4-bit quantized model without encountering these warnings? Why am I receiving an AttributeError when trying to access the tokenizers? Does this model not support them?

How to Load a 4-bit Quantized VLM Model from Hugging Face with Transformers?

Answers (1)

Related Questions