Reputation: 31
Hi guys I am running the following code:
model = AutoModelForCausalLM.from_pretrained('./cache/model')
tokenizer = AutoTokenizer.from_pretrained('./cache/model')
where I have cached a hugging face model using cache_dir within the from_pretraind() method. However, everytime I load the model it requires to load the checkpoint shards which takes 7-10 minutes for each inference.
Loading checkpoint shards: 67%|######6 | 2/3 [06:17<03:08, 188.79s/it]
This is taking so long even though I am loading the model locally where it is already installed?
I am using some powerful GPUs so my actual inference is just a few seconds but the time it takes to load the model into memory is so long.
Is there any way around this? I saw someone on a similar thread say they used safe_serialization when using the save_pretrained() method but my issue is I am loading a pretrained model and not fine-tuning and saving my own. Hence, I am unsure how to apply this plausible solution.
Any help here would be great.
Upvotes: 2
Views: 3653