How to force langchain to use HF_DATA environment variable to load the model from local disk instead of Internet

Question

How to force langchain to use HF_DATA environment variable to load the model.

The Snowflake/snowflake-arctic-embed-l model files have been downloaded to $HF_HOME/Snowflake/snowflake-arctic-embed-l.

$ echo $HF_HOME
/tmp-data

$ls /tmp-data/Snowflake/snowflake-arctic-embed-l
1_Pooling    README.md    config_sentence_transformers.json  model.safetensors  sentence_bert_config.json  tokenizer.json         vocab.txt
2_Normalize  config.json  hoge.tgz                           modules.json       special_tokens_map.json    tokenizer_config.json

Python runtime acknowledges HF_DATA environment variable.

>>> import os
>>> os.getenv("HF_HOME")
'/tmp-data'

However, it tries to download the model from the internet.

from langchain.text_splitter import SentenceTransformersTokenTextSplitter

splitter=SentenceTransformersTokenTextSplitter(
  model_name="Snowflake/snowflake-arctic-embed-l", 
  tokens_per_chunk=500, 
  chunk_overlap=50
)

No sentence-transformers model found with name Snowflake/snowflake-arctic-embed-l. Creating a new one with mean pooling.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 488, in _make_request
    raise new_e
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 464, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 704, in connect
    self.sock = sock = self._new_conn()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 205, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: : Failed to resolve 'huggingface.co' ([Errno -2] Name or service not known)

Giving the full path to the local model directory fixes the issue but need to utilise HF_HOME.

splitter=SentenceTransformersTokenTextSplitter(model_name="/tmp-data/Snowflake/snowflake-arctic-embed-l", tokens_per_chunk=500, chunk_overlap=50)
You try to use a model that was created with version 3.4.1, however, your version is 3.0.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.

How to force langchain to use HF_DATA environment variable to load the model from local disk instead of Internet

Answers (1)

Related Questions