mon
mon

Reputation: 22356

How to force langchain to use HF_DATA environment variable to load the model from local disk instead of Internet

How to force langchain to use HF_DATA environment variable to load the model.

The Snowflake/snowflake-arctic-embed-l model files have been downloaded to $HF_HOME/Snowflake/snowflake-arctic-embed-l.

$ echo $HF_HOME
/tmp-data

$ls /tmp-data/Snowflake/snowflake-arctic-embed-l
1_Pooling    README.md    config_sentence_transformers.json  model.safetensors  sentence_bert_config.json  tokenizer.json         vocab.txt
2_Normalize  config.json  hoge.tgz                           modules.json       special_tokens_map.json    tokenizer_config.json

Python runtime acknowledges HF_DATA environment variable.

>>> import os
>>> os.getenv("HF_HOME")
'/tmp-data'

However, it tries to download the model from the internet.

from langchain.text_splitter import SentenceTransformersTokenTextSplitter

splitter=SentenceTransformersTokenTextSplitter(
  model_name="Snowflake/snowflake-arctic-embed-l", 
  tokens_per_chunk=500, 
  chunk_overlap=50
)
No sentence-transformers model found with name Snowflake/snowflake-arctic-embed-l. Creating a new one with mean pooling.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 488, in _make_request
    raise new_e
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 464, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 704, in connect
    self.sock = sock = self._new_conn()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 205, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7d1562b96680>: Failed to resolve 'huggingface.co' ([Errno -2] Name or service not known)

Giving the full path to the local model directory fixes the issue but need to utilise HF_HOME.

splitter=SentenceTransformersTokenTextSplitter(model_name="/tmp-data/Snowflake/snowflake-arctic-embed-l", tokens_per_chunk=500, chunk_overlap=50)
You try to use a model that was created with version 3.4.1, however, your version is 3.0.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.

Upvotes: 0

Views: 15

Answers (1)

Wiggy A.
Wiggy A.

Reputation: 496

There's no built-in support for HF_HOME in langchain (or it's dependencies - https://github.com/UKPLab/sentence-transformers in this case). You'll just have to set up a method to prefix your model-names with the HF_HOME variable on your end. Unless you're open to monkey patching the library, but I'm not suggesting that for something like this.

Something like this perhaps:

from langchain.text_splitter import SentenceTransformersTokenTextSplitter
import os


HF_HOME = os.environ.get("HF_HOME")


def prefix_hf_home(model_name):
    return os.path.join(HF_HOME, model_name)


splitter=SentenceTransformersTokenTextSplitter(
  model_name=prefix_hf_home("Snowflake/snowflake-arctic-embed-l"), 
  tokens_per_chunk=500, 
  chunk_overlap=50
)

Upvotes: 0

Related Questions