Reputation: 3524
We are trying to deploy a quantized Llama 3.1 70B model(from Huggingface, using bitsandbytes), quantizing part works fine as we check the model memory which is correct and also test getting predictions for the model, which is also correct, the problem is: after saving the quantized model and then loading it we get
valueError: Supplied state dict for layers.0.mlp.down_proj.weight does not contain
bitsandbytes__*
and possibly otherquantized_stats
components
What we do is:
Here is the code:
model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
cache_dir = "/home/ec2-user/SageMaker/huggingface_cache"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=quantization_config,
low_cpu_mem_usage=True,
offload_folder="offload",
offload_state_dict=True,
cache_dir=cache_dir
)
tokenizer = AutoTokenizer.from_pretrained(model_id,cache_dir=cache_dir)
pt_save_directory = "test_directory"
tokenizer.save_pretrained(pt_save_directory,)
model_4bit.save_pretrained(pt_save_directory)
## test load it
loaded_model = AutoModel.from_pretrained(pt_save_directory,
quantization_config=quantization_config
)
Upvotes: 0
Views: 576
Reputation: 320
This is a bug in the _load_pretrained_model() function of transformers/modeling_utils.py when loading sharded weight files. The state_dict is applied to the empty model per shard. This is problematic as the quantized weight and its meta data(*.quant_state.bitsandbytes__nf4) may be stored in the different shards. The quick-and-dirty fix is to merge tensors from all shards into one state_dict. Similar issues have been reported on the unsloth github issue 638
Upvotes: 1