What is the typical data ingestion speed for adding embeddings to ChromaDB?

Question

I am trying to add documents to by chromaDB(in persistent mode) using collection.add(), I use the default sentence transformer to encode the abstracts. I followed this thread and do thing similar to the suggested solution. I am getting around 8K embeddings/hours, which seems quite slow, I have a A100 GPU and 40 core CPU, I have passed device="cuda" argument as well to the model.

What is the expected throughput on such a hardware? Am i doing something wrong?

def consumer(use_cuda, queue):
    # Instantiate chromadb instance. Data is stored on disk (a folder named 'my_vectordb' will be created in the same folder as this file).
    chroma_client = chromadb.PersistentClient(path="my_vectordb")
    device = 'cuda' if use_cuda else 'cpu'
    # Select the embedding model to use.
    # List of model names can be found here https://www.sbert.net/docs/pretrained_models.html
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2", device=device)

    # Create the collection, aka vector database. Or, if database already exist, then use it. Specify the model that we want to use to do the embedding.
    collection = chroma_client.get_collection(name="pubmed_0", embedding_function=sentence_transformer_ef)

    while True:
        # Check for items in queue, this process blocks until queue has items to process.
        batch = queue.get()
        if batch is None:
            break
        
        # Add to collection
        collection.add(
            documents=batch[0],
            metadatas=batch[1],
            ids=batch[2]
        )

if __name__ == "__main__":

    chroma_client = chromadb.PersistentClient(path="my_vectordb")
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2")

    # For cleaner reloading, delete and recreate collection  
    try:
        chroma_client.get_collection(name="pubmed_0")
        chroma_client.delete_collection(name="pubmed_0")
    except Exception as err:
        print(err)

    collection = chroma_client.create_collection(name="pubmed_0", embedding_function=sentence_transformer_ef)

    # Create a shared queue
    queue = mp.Queue()

    # Create producer and consumer processes.
    producer_process = mp.Process(target=producer, args=('/home/hasan/hpc_mnt_hasan/pubmed_download_extract/pubmed_dataset_out/train/data-00109-of-00110.arrow', 32, queue,))
    consumer_process = mp.Process(target=consumer, args=(True, queue,))
    # Do not create multiple consumer processes, because ChromaDB is not multiprocess safe.

    start_time = time.time()

    # Start processes
    producer_process.start()
    consumer_process.start()

    # Wait for producer to finish producing
    producer_process.join()

    # Signal consumer to stop consuming by putting None into the queue. Need 2 None's to stop 2 consumers.    
    queue.put(None)

    # Wait for consumer to finish consuming
    consumer_process.join()

    print(f"Elapsed seconds: {time.time()-start_time:.0f} Record count: {collection.count()}")

https://cookbook.chromadb.dev/strategies/batching/#creating-batches

Chroma cookbook suggests that batching is happening however I suspect it might be computing embeddings one by one, also nvidia-smi is showing 0MB used. What could be the issue here?

What is the typical data ingestion speed for adding embeddings to ChromaDB?

Answers (0)

Related Questions