Reputation: 1371
I'm very new to generative AI. I have 64gb RAM and 20GB GPU. I used some opensource model from Huggingface and used Python to simply prompt it with out of box model and displaying the result. I downloaded the model to local using save_pretrained
and trying to load the model from local there after. It works. But everytime I run the python file it takes more than 10 mins to display the results.
There is a step Loading checkpoint shards
that takes 6-7 mins everytime. Am I doing anything wrong? why it has to load something everytime even though the model is refered from local.
I tried using local_files_only=True, cache_dir=cache_dir, low_cpu_mem_usage=True, max_shard_size="200MB"
, none solved the time issue .
How to prompt the saved model directly without so much delay as user usable. Any help would be highly appreciated
Upvotes: 12
Views: 22308
Reputation: 1
You can try this method:
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True).eval()
It may work.
Upvotes: -1
Reputation: 39
I had exactly the same problem and I fixed it by setting safe_serialization=True
when using the save_pretrained()
method. Hope this works for you. However, I do want to know what was going on when loading a model with a vanilla .bin format.
Upvotes: 3