Reputation: 701
I'm using langchain to process a whole bunch of documents which are in an Mongo database.
I can load all documents fine into the chromadb vector storage using langchain. Nothing fancy being done here. This is my code:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings, persist_directory='db')
db.persist()
Now, after storing the data, I want to get a list of all the documents and embeddings WITH id's.
This is so I can store them back into MongoDb.
I also want to put them through Bertopic to get the topic categories.
Question 1 is: how do I get all documents I've just stored in the Chroma database? I want the documents, and all the metadata.
Many thanks for your help!
Upvotes: 17
Views: 56450
Reputation: 21
Try this. I usually use this with chromadb library.
chroma_client=chromadb.Client()
# Create the open-source embedding function
embedding_function1 = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
client = chromadb.PersistentClient(path="./chroma_db1")
col = client.get_or_create_collection(name="test1", embedding_function=embedding_function1)
all_data = col.get(
include=[ "documents","metadatas"],
limit=5
)
all_data
Upvotes: 0
Reputation: 756
This worked for me, I just needed to get a list of the file names from the source key in the chroma db. I didn't want all the other metadata, just the source files.
## get list of all file URLs in vector db
vectordb = Chroma.from_documents(texts, embeddings, persist_directory="db2")
db = vectordb
print(db.get().keys())
print(len(db.get()["ids"]))
# Print the list of source files
for x in range(len(db.get()["ids"])):
# print(db.get()["metadatas"][x])
doc = db.get()["metadatas"][x]
source = doc["source"]
print(source)
outputs:
dict_keys(['ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data'])
140
csv-files/small-csv.csv
csv-files/dataset-04-17-2024.csv
ppt-content/presentation.pptx
csv-files/events-small-csv.csv
csv-files/small-csv.csv
To list all docs and content in the embeddings,
db.get()
Upvotes: 1
Reputation: 748
Once the DB is created, you can create a client separately using the DB persist directory as below
import chromadb
client = chromadb.Client(Settings(is_persistent=True,
persist_directory= <PERSIST_DIR_NAME>,
))
coll = client.get_collection("<name of the collection>")
coll.get() # Gets all the data
You get a JSON with all embedded info, Metadata, Source and Documents as well.
Upvotes: 9
Reputation: 364
Looking at the source code (https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/chroma.py)
You can just call below
db.get()
and you will get a json output with the id's, embeddings and docs data.
Upvotes: 19