user791793
user791793

Reputation: 701

Get all documents from ChromaDb using Python and langchain

I'm using langchain to process a whole bunch of documents which are in an Mongo database.

I can load all documents fine into the chromadb vector storage using langchain. Nothing fancy being done here. This is my code:


from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings, persist_directory='db')
db.persist()

Now, after storing the data, I want to get a list of all the documents and embeddings WITH id's.

This is so I can store them back into MongoDb.

I also want to put them through Bertopic to get the topic categories.

Question 1 is: how do I get all documents I've just stored in the Chroma database? I want the documents, and all the metadata.

Many thanks for your help!

Upvotes: 17

Views: 56450

Answers (4)

Nirmal Hasantha
Nirmal Hasantha

Reputation: 21

Try this. I usually use this with chromadb library.

chroma_client=chromadb.Client()

# Create the open-source embedding function
embedding_function1 = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

client = chromadb.PersistentClient(path="./chroma_db1")
col = client.get_or_create_collection(name="test1", embedding_function=embedding_function1)

all_data = col.get(
    include=[ "documents","metadatas"],
    limit=5
)

all_data

Upvotes: 0

texasdave
texasdave

Reputation: 756

This worked for me, I just needed to get a list of the file names from the source key in the chroma db. I didn't want all the other metadata, just the source files.

## get list of all file URLs in vector db

vectordb = Chroma.from_documents(texts, embeddings, persist_directory="db2")

db = vectordb
print(db.get().keys())
print(len(db.get()["ids"]))

# Print the list of source files
for x in range(len(db.get()["ids"])):
    # print(db.get()["metadatas"][x])
    doc = db.get()["metadatas"][x]
    source = doc["source"]
    print(source)

outputs:

dict_keys(['ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data'])
140

csv-files/small-csv.csv
csv-files/dataset-04-17-2024.csv
ppt-content/presentation.pptx
csv-files/events-small-csv.csv
csv-files/small-csv.csv

To list all docs and content in the embeddings,



db.get()

Upvotes: 1

Karthik Sunil
Karthik Sunil

Reputation: 748

Once the DB is created, you can create a client separately using the DB persist directory as below

import chromadb
client = chromadb.Client(Settings(is_persistent=True,
                                    persist_directory= <PERSIST_DIR_NAME>,
                                ))
coll = client.get_collection("<name of the collection>")
coll.get() # Gets all the data

You get a JSON with all embedded info, Metadata, Source and Documents as well.

Upvotes: 9

carteakey
carteakey

Reputation: 364

Looking at the source code (https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/chroma.py)

You can just call below

db.get()

and you will get a json output with the id's, embeddings and docs data.

Upvotes: 19

Related Questions