Problem setting up Llama-2 in Google Colab - Cell-run fails when loading checkpoint shards

Question

I'm trying to use Llama 2 chat (via hugging face) with 7B parameters in Google Colab (Python 3.10.12). I've already obtain my access token via Meta. I simply use the code in hugging face on how to implement the model along with my access token. Here is my code:

!pip install transformers
 
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

token = "---Token copied from Hugging Face and pasted here---"

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=token)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=token)

It starts downloading the model but when it reaches Loading checkpoint shards: it just stops running and there is no error:

Johnny Cheesecutter · Accepted Answer

The issue is with Colab instance running out of RAM. Based on your comments you are using basic Colab instance with 12.7 Gb CPU RAM.

For LLama model you'll need:

for the float32 model about 25 Gb (but you'll need both cpu RAM and same 25 gb GPU ram);
for the bfloat16 model around 13 Gb (and still not enough to fit basic Colab Cpu instance, given that you'll also need to store intermediate calculations from the model);

Check this link for the details on the required resources: huggingface.co/NousResearch/Llama-2-7b-chat-hf/discussions/3

Also if you want only to do inference (predictions) on the model I would recommend to use it's quantized 4bit or 8bit versions. Both can be ran on CPU and don't need a lot of memory.

Problem setting up Llama-2 in Google Colab - Cell-run fails when loading checkpoint shards

Answers (1)

Related Questions