Reputation: 22356
How to force langchain to use HF_DATA
environment variable to load the model.
The Snowflake/snowflake-arctic-embed-l
model files have been downloaded to $HF_HOME/Snowflake/snowflake-arctic-embed-l
.
$ echo $HF_HOME
/tmp-data
$ls /tmp-data/Snowflake/snowflake-arctic-embed-l
1_Pooling README.md config_sentence_transformers.json model.safetensors sentence_bert_config.json tokenizer.json vocab.txt
2_Normalize config.json hoge.tgz modules.json special_tokens_map.json tokenizer_config.json
Python runtime acknowledges HF_DATA
environment variable.
>>> import os
>>> os.getenv("HF_HOME")
'/tmp-data'
However, it tries to download the model from the internet.
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
splitter=SentenceTransformersTokenTextSplitter(
model_name="Snowflake/snowflake-arctic-embed-l",
tokens_per_chunk=500,
chunk_overlap=50
)
No sentence-transformers model found with name Snowflake/snowflake-arctic-embed-l. Creating a new one with mean pooling.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 198, in _new_conn
sock = connection.create_connection(
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 488, in _make_request
raise new_e
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 704, in connect
self.sock = sock = self._new_conn()
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 205, in _new_conn
raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7d1562b96680>: Failed to resolve 'huggingface.co' ([Errno -2] Name or service not known)
Giving the full path to the local model directory fixes the issue but need to utilise HF_HOME
.
splitter=SentenceTransformersTokenTextSplitter(model_name="/tmp-data/Snowflake/snowflake-arctic-embed-l", tokens_per_chunk=500, chunk_overlap=50)
You try to use a model that was created with version 3.4.1, however, your version is 3.0.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.
Upvotes: 0
Views: 15
Reputation: 496
There's no built-in support for HF_HOME
in langchain (or it's dependencies - https://github.com/UKPLab/sentence-transformers in this case). You'll just have to set up a method to prefix your model-names with the HF_HOME
variable on your end. Unless you're open to monkey patching the library, but I'm not suggesting that for something like this.
Something like this perhaps:
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
import os
HF_HOME = os.environ.get("HF_HOME")
def prefix_hf_home(model_name):
return os.path.join(HF_HOME, model_name)
splitter=SentenceTransformersTokenTextSplitter(
model_name=prefix_hf_home("Snowflake/snowflake-arctic-embed-l"),
tokens_per_chunk=500,
chunk_overlap=50
)
Upvotes: 0