Reputation: 453
I'm using the LlamaIndex
library in my Python project to handle some data processing tasks. According to the documentation (Link), I can control the location where additional data is downloaded by setting the LLAMA_INDEX_CACHE_DIR
environment variable. However, despite setting this environment variable, the LlamaIndex
library seems to ignore it and continues to store data in a different location.
Here's how I'm setting the environment variable in my Python script:
import os
os.environ["LLAMA_INDEX_CACHE_DIR"] = "/path/to/my/cache/directory"
When creating the index storage (see code below), nltk_data
gets downloaded to /Users/user/nltk_data
instead of the path I set in as the environment variable.
loader = UnstructuredReader()
doc = loader.load_data(file=Path(file), split_documents=False)
storage_context = StorageContext.from_defaults()
cur_index = VectorStoreIndex.from_documents(doc, storage_context=storage_context)
storage_context.persist(persist_dir=f"./storage/name")
I've checked for typos, ensured correct permissions on the cache directory, and set the environment variable before importing the LlamaIndex
library, but the issue persists.
Could anyone suggest why LlamaIndex
might not be respecting the LLAMA_INDEX_CACHE_DIR
environment variable, and how I can troubleshoot or resolve this issue?
Any insights or suggestions would be greatly appreciated. Thank you!
Upvotes: 0
Views: 454
Reputation: 1
I just had the same issue. What worked for me was setting TIKTOKEN_CACHE_DIR
instead of LLAMA_INDEX_CACHE_DIR
.
In my case llama_index was loading some other library called tiktoken:
enc = tiktoken.get_encoding("gpt2")
And then that checks these two environment variables instead of LLAMA_INDEX_CACHE_DIR
:
if "TIKTOKEN_CACHE_DIR" in os.environ:
cache_dir = os.environ["TIKTOKEN_CACHE_DIR"]
elif "DATA_GYM_CACHE_DIR" in os.environ:
cache_dir = os.environ["DATA_GYM_CACHE_DIR"]
Upvotes: 0