Reputation: 11
I am trying to add documents to by chromaDB(in persistent mode) using collection.add(), I use the default sentence transformer to encode the abstracts. I followed this thread and do thing similar to the suggested solution. I am getting around 8K embeddings/hours, which seems quite slow, I have a A100 GPU and 40 core CPU, I have passed device="cuda" argument as well to the model.
What is the expected throughput on such a hardware? Am i doing something wrong?
def consumer(use_cuda, queue):
# Instantiate chromadb instance. Data is stored on disk (a folder named 'my_vectordb' will be created in the same folder as this file).
chroma_client = chromadb.PersistentClient(path="my_vectordb")
device = 'cuda' if use_cuda else 'cpu'
# Select the embedding model to use.
# List of model names can be found here https://www.sbert.net/docs/pretrained_models.html
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2", device=device)
# Create the collection, aka vector database. Or, if database already exist, then use it. Specify the model that we want to use to do the embedding.
collection = chroma_client.get_collection(name="pubmed_0", embedding_function=sentence_transformer_ef)
while True:
# Check for items in queue, this process blocks until queue has items to process.
batch = queue.get()
if batch is None:
break
# Add to collection
collection.add(
documents=batch[0],
metadatas=batch[1],
ids=batch[2]
)
if __name__ == "__main__":
chroma_client = chromadb.PersistentClient(path="my_vectordb")
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2")
# For cleaner reloading, delete and recreate collection
try:
chroma_client.get_collection(name="pubmed_0")
chroma_client.delete_collection(name="pubmed_0")
except Exception as err:
print(err)
collection = chroma_client.create_collection(name="pubmed_0", embedding_function=sentence_transformer_ef)
# Create a shared queue
queue = mp.Queue()
# Create producer and consumer processes.
producer_process = mp.Process(target=producer, args=('/home/hasan/hpc_mnt_hasan/pubmed_download_extract/pubmed_dataset_out/train/data-00109-of-00110.arrow', 32, queue,))
consumer_process = mp.Process(target=consumer, args=(True, queue,))
# Do not create multiple consumer processes, because ChromaDB is not multiprocess safe.
start_time = time.time()
# Start processes
producer_process.start()
consumer_process.start()
# Wait for producer to finish producing
producer_process.join()
# Signal consumer to stop consuming by putting None into the queue. Need 2 None's to stop 2 consumers.
queue.put(None)
# Wait for consumer to finish consuming
consumer_process.join()
print(f"Elapsed seconds: {time.time()-start_time:.0f} Record count: {collection.count()}")
https://cookbook.chromadb.dev/strategies/batching/#creating-batches
Chroma cookbook suggests that batching is happening however I suspect it might be computing embeddings one by one, also nvidia-smi is showing 0MB used. What could be the issue here?
Upvotes: 0
Views: 72