Reputation: 261
I have written LangChain code using Chroma DB to vector store the data from a website url. It currently works to get the data from the URL, store it into the project folder and then use that data to respond to a user prompt. I figured out how to make that data persist/be stored after the run, but I can't figure out how to then load that data for future prompts. The goal is a user input is received, and the program using OpenAI LLM will generate a response based on the existing database files, as opposed to the program needing to create/write those database files on each run. How can this be done?
What should I do?
I tried this as this would likely be the ideal solution:
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=vectordb)
But the from_chain_type()
function doesn't take a vectorstore
db as an input, so therefore this doesn't work.
Upvotes: 26
Views: 66232
Reputation: 1
Since the provided solution didn’t work for me and it returns an empty list, and seeing on the internet that many others have faced the same issue, I’d like to offer an alternative solution.
First, you’ll need to install chromadb:
pip install chromadb
Or if you're using a notebook, such as a Colab notebook:
!pip install chromadb
Next, load your vector database as follows:
import chromadb
from langchain_chroma import Chroma
client = chromadb.PersistentClient(path='PATH_TO_YOUR_STORED_VECTOR_STORAGE')
embedding_fn = OpenAIEmbeddings(
model='text-embedding-3-small', # Or other Open-Ai's embeddings models
api_key=API_KEY,
chunk_size=INT_CHUNK_SIZE,
) # Or other embedding functions
vector_storage = Chroma(
client=client,
collection_name="NAME_OF_THE_COLLECTION_YOU_WANT_TO_LOAD",
embedding_function=embedding_fn
)
To view your stored collections, you can use:
client.list_collections()
I hope this solution helps resolve the issue for you as well.
Upvotes: 0
Reputation: 1
This Solution works, the main difference is that you also need to specify the collection name.
Saving the database:
vectorstore = Chroma.from_documents(
documents=doc_splits,
collection_name="rag-chroma",
embedding=embd,
persist_directory="chroma_langchain_db",
)
If you use langchain_chroma library you do not need to add the vectorstore.persist() function, else that after the above code. Code for loading the database:
vectorstore = Chroma(
collection_name="rag-chroma",
embedding_function=embd,
persist_directory="chroma_langchain_db",
)
Upvotes: 0
Reputation: 21
You might want to specify a collection name when creating the vector store. If you have a persist directory, then you should be able to retrieve the vector stores and the documents.
PERSIST_DIRECTORY = '/path/to/persist/directory'
vector_store = Chroma(
collection_name="my_collection",
embedding_function=embeddings,
persist_directory=PERSIST_DIRECTORY,
)
# Add documents
# To load the vector store, use exactly the same expression
vector_store = Chroma(
collection_name="my_collection",
embedding_function=embeddings,
persist_directory=PERSIST_DIRECTORY,
)
# Check that documents are there
vector_store.get()['documents']
Upvotes: 2
Reputation: 49182
RetrievalQA
itself a chain. this is how we import:
from langchain.chains import RetrievalQA
every chain has two important components: PromptTemplate
and llm
. RetrievalQA
needs to get documents and stuff these documents into its own PromptTemplate
. That is what this argument for:
chain_type="stuff",
RetrievalQA
has another keyword argument retriever
. this is a communication between RetrievalQA
chain and different vector stores. RetrievalQA
retrieves documents from vector stores through retriever
. Vector stores do the similarity search
and return the documents to the RetrievalQA
. you created
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
now have RetrievalQA
communicates with this vector store through retriever
qa = RetrievalQA.from_chain_type(llm=llm,
chain_type="stuff",
# this will make similarity search in vectordb
retriever=vectordb.as_retriever())
Upvotes: 0
Reputation: 1
def load_api_key(secrets_file="secrets.json"):
with open(secrets_file) as f:
secrets = json.load(f)
return secrets["OPENAI_API_KEY"]
Instead of doing this you can create a .env
(secret file) and place your openaikey. Like this:
OPENAI_API_KEY = "<your_key>"
Then load it in your main file and in your main function like this:
from dotenv import load_dotenv
USAGE:
load_dotenv()
Upvotes: 0
Reputation: 121
All the answers I have seen are missing one crucial step to call persist the DB. As a complete solution, you need to perform following steps.
To create db first time and persist it using the below lines.
vectordb = Chroma.from_documents(data, embedding=embeddings, persist_directory = persist_directory)
vectordb.persist()
The db can then be loaded using the below line.
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
Upvotes: 12
Reputation: 2460
Chroma provides get_collection
at
https://docs.trychroma.com/reference/Client#get_collection
Here's an example of my code to query an existing vectorStore >
def get(embedding_function):
db = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
print(db.get().keys())
print(len(db.get()["ids"]))
The code output with 7580 chunks
, as example >
Using embedded DuckDB with persistence: data will be stored in: ./chroma_db
dict_keys(['ids', 'embeddings', 'documents', 'metadatas'])
7580
Upvotes: 0
Reputation: 988
just find the following works:
def fetch_embeddings(collection_name):
collection = chromadb_client.get_collection(
name=collection_name, embedding_function=langchain_embedding_function
)
embeddings = collection.get(include=["embeddings"])
print(collection.get(include=["embeddings", "documents", "metadatas"]))
return embeddings
reference: https://docs.trychroma.com/usage-guide
Upvotes: 0
Reputation: 171
I have tried to use the Chroma vector store loader as well, but my code won't load the DB from the disk. Here is what I did:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFDirectoryLoader
import os
import json
def load_api_key(secrets_file="secrets.json"):
with open(secrets_file) as f:
secrets = json.load(f)
return secrets["OPENAI_API_KEY"]
# Setup
api_key = load_api_key()
os.environ["OPENAI_API_KEY"] = api_key
# load the document and split it into chunks
loader = PyPDFDirectoryLoader("LINK TO FOLDER WITH PDF")
documents = loader.load()
# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
docs = text_splitter.split_documents(documents)
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# load docs into Chroma DB
db = Chroma.from_documents(docs, embedding_function)
# query the DB
query = "MY QUERY"
docs = db.similarity_search(query)
# print results
print(docs[0].page_content)
# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
So far no problems! Then when I load the DB with this code:
# load from disk
db3 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
db3.get()
docs = db3.similarity_search(query)
print(docs[0].page_content)
The db3.get()
already shows that there is no data in db3
. It returns:
{'ids': [], 'embeddings': None, 'documents': [], 'metadatas': []}
Any ideas why this could by?
Upvotes: 3
Reputation: 373
You need to define the retriever and pass that to the chain. That will use your previously persisted DB to be used in queries.
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
retriever = vectordb.as_retriever()
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
Upvotes: 18