max choate
max choate

Reputation: 261

LangChain Chroma - load data from Vector Database

I have written LangChain code using Chroma DB to vector store the data from a website url. It currently works to get the data from the URL, store it into the project folder and then use that data to respond to a user prompt. I figured out how to make that data persist/be stored after the run, but I can't figure out how to then load that data for future prompts. The goal is a user input is received, and the program using OpenAI LLM will generate a response based on the existing database files, as opposed to the program needing to create/write those database files on each run. How can this be done?

What should I do?

I tried this as this would likely be the ideal solution:

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=vectordb)

But the from_chain_type() function doesn't take a vectorstore db as an input, so therefore this doesn't work.

Upvotes: 26

Views: 66232

Answers (10)

Danial Baledi
Danial Baledi

Reputation: 1

Since the provided solution didn’t work for me and it returns an empty list, and seeing on the internet that many others have faced the same issue, I’d like to offer an alternative solution.

First, you’ll need to install chromadb:

pip install chromadb

Or if you're using a notebook, such as a Colab notebook:

!pip install chromadb

Next, load your vector database as follows:

import chromadb
from langchain_chroma import Chroma

client = chromadb.PersistentClient(path='PATH_TO_YOUR_STORED_VECTOR_STORAGE')
embedding_fn = OpenAIEmbeddings(
        model='text-embedding-3-small', # Or other Open-Ai's embeddings models
        api_key=API_KEY,
        chunk_size=INT_CHUNK_SIZE,
    ) # Or other embedding functions

vector_storage =  Chroma(
    client=client,
    collection_name="NAME_OF_THE_COLLECTION_YOU_WANT_TO_LOAD",
    embedding_function=embedding_fn
)

To view your stored collections, you can use:

client.list_collections()

I hope this solution helps resolve the issue for you as well.

Upvotes: 0

Shaashwat Agrawal
Shaashwat Agrawal

Reputation: 1

This Solution works, the main difference is that you also need to specify the collection name.

Saving the database:

vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="rag-chroma",
    embedding=embd,
    persist_directory="chroma_langchain_db",
)

If you use langchain_chroma library you do not need to add the vectorstore.persist() function, else that after the above code. Code for loading the database:

vectorstore = Chroma(
    collection_name="rag-chroma",
    embedding_function=embd,
    persist_directory="chroma_langchain_db",
)

Upvotes: 0

Emile
Emile

Reputation: 21

You might want to specify a collection name when creating the vector store. If you have a persist directory, then you should be able to retrieve the vector stores and the documents.

PERSIST_DIRECTORY = '/path/to/persist/directory'

vector_store = Chroma(
    collection_name="my_collection",
    embedding_function=embeddings,
    persist_directory=PERSIST_DIRECTORY,
)

# Add documents


# To load the vector store, use exactly the same expression

vector_store = Chroma(
    collection_name="my_collection",
    embedding_function=embeddings,
    persist_directory=PERSIST_DIRECTORY,
)

# Check that documents are there 

vector_store.get()['documents']

Upvotes: 2

Yilmaz
Yilmaz

Reputation: 49182

RetrievalQA itself a chain. this is how we import:

from langchain.chains import RetrievalQA

every chain has two important components: PromptTemplate and llm. RetrievalQA needs to get documents and stuff these documents into its own PromptTemplate. That is what this argument for:

chain_type="stuff",

RetrievalQA has another keyword argument retriever. this is a communication between RetrievalQA chain and different vector stores. RetrievalQA retrieves documents from vector stores through retriever. Vector stores do the similarity search and return the documents to the RetrievalQA. you created

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

now have RetrievalQA communicates with this vector store through retriever

qa = RetrievalQA.from_chain_type(llm=llm, 
                                 chain_type="stuff",
                                 # this will make similarity search in vectordb
                                 retriever=vectordb.as_retriever())

Upvotes: 0

Saif Pasha
Saif Pasha

Reputation: 1

def load_api_key(secrets_file="secrets.json"):
    with open(secrets_file) as f:
        secrets = json.load(f)
    return secrets["OPENAI_API_KEY"]

Instead of doing this you can create a .env (secret file) and place your openaikey. Like this:

OPENAI_API_KEY = "<your_key>"

Then load it in your main file and in your main function like this:

from dotenv import load_dotenv

USAGE:

load_dotenv()

Upvotes: 0

Gautam Chauhan
Gautam Chauhan

Reputation: 121

All the answers I have seen are missing one crucial step to call persist the DB. As a complete solution, you need to perform following steps.

To create db first time and persist it using the below lines.

vectordb = Chroma.from_documents(data, embedding=embeddings, persist_directory = persist_directory)
vectordb.persist()

The db can then be loaded using the below line.

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

Upvotes: 12

j3ffyang
j3ffyang

Reputation: 2460

Chroma provides get_collection at

https://docs.trychroma.com/reference/Client#get_collection

Here's an example of my code to query an existing vectorStore >

def get(embedding_function):
    db = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
    print(db.get().keys())
    print(len(db.get()["ids"]))

The code output with 7580 chunks, as example >

Using embedded DuckDB with persistence: data will be stored in: ./chroma_db
dict_keys(['ids', 'embeddings', 'documents', 'metadatas'])
7580

Upvotes: 0

s00103898-276165-15433
s00103898-276165-15433

Reputation: 988

just find the following works:

def fetch_embeddings(collection_name):
    collection = chromadb_client.get_collection(
        name=collection_name, embedding_function=langchain_embedding_function
    )
    embeddings = collection.get(include=["embeddings"])

    print(collection.get(include=["embeddings", "documents", "metadatas"]))

    return embeddings

reference: https://docs.trychroma.com/usage-guide

Upvotes: 0

Heka
Heka

Reputation: 171

I have tried to use the Chroma vector store loader as well, but my code won't load the DB from the disk. Here is what I did:

from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFDirectoryLoader
import os
import json

def load_api_key(secrets_file="secrets.json"):
    with open(secrets_file) as f:
        secrets = json.load(f)
    return secrets["OPENAI_API_KEY"]

# Setup
api_key = load_api_key()
os.environ["OPENAI_API_KEY"] = api_key

# load the document and split it into chunks
loader = PyPDFDirectoryLoader("LINK TO FOLDER WITH PDF")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load docs into Chroma DB
db = Chroma.from_documents(docs, embedding_function)

# query the DB
query = "MY QUERY"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")

So far no problems! Then when I load the DB with this code:

# load from disk
db3 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
db3.get() 
docs = db3.similarity_search(query)
print(docs[0].page_content)

The db3.get() already shows that there is no data in db3. It returns:

{'ids': [], 'embeddings': None, 'documents': [], 'metadatas': []}

Any ideas why this could by?

Upvotes: 3

Andrew
Andrew

Reputation: 373

You need to define the retriever and pass that to the chain. That will use your previously persisted DB to be used in queries.

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

Upvotes: 18

Related Questions