Reputation: 843
I'm trying to use Llama 2 chat (via hugging face) with 7B parameters in Google Colab (Python 3.10.12). I've already obtain my access token via Meta. I simply use the code in hugging face on how to implement the model along with my access token. Here is my code:
!pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
token = "---Token copied from Hugging Face and pasted here---"
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=token)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=token)
It starts downloading the model but when it reaches Loading checkpoint shards: it just stops running and there is no error:
Upvotes: 3
Views: 1335
Reputation: 2852
The issue is with Colab instance running out of RAM. Based on your comments you are using basic Colab instance with 12.7 Gb CPU RAM.
For LLama model you'll need:
Check this link for the details on the required resources: huggingface.co/NousResearch/Llama-2-7b-chat-hf/discussions/3
Also if you want only to do inference (predictions) on the model I would recommend to use it's quantized 4bit or 8bit versions. Both can be ran on CPU and don't need a lot of memory.
Upvotes: 5