Reputation: 11
I'm currently trying to integrate Cohere embeddings into ChromaDB, but I'm facing an issue when adding documents to my Chroma collection. I am using chromadb 0.5.11 and cohere 5.10.0. When I use the Cohere API directly, everything works fine. For example:
import cohere
co = cohere.ClientV2(CO_API_KEY)
response = co.embed(
texts=["hello", "goodbye"],
model="embed-english-v3.0",
input_type="search_document",
embedding_types=["float"]
)
This code runs successfully and generates embeddings without any issue.
However, when I try to integrate Cohere embeddings into ChromaDB using the following code:
from chromadb.utils import embedding_functions
co_ef = embedding_functions.CohereEmbeddingFunction(model_name="embed-english-v3.0", api_key=CO_API_KEY)
path = "root/data"
client = chromadb.PersistentClient(path=path)
collection = client.get_or_create_collection(
name="vector_db",
embedding_function=co_ef,
metadata={"hnsw:space": "cosine"}
)
collection.add(documents=["hello", "goodbye"], ids=["0", "1"])
I get the following error:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It seems like the error occurs when adding the documents with collection.add(). Based on the error message, it looks like Chroma is expecting homogeneous arrays, but something about the Cohere embeddings or how they are processed by Chroma seems to be causing a mismatch.
Ensuring that Cohere is generating embeddings in the correct format (I assume this is handled by CohereEmbeddingFunction). Changing the documents/inputs in case the issue was related to the specific strings, but the error persists.
Upvotes: 0
Views: 98
Reputation: 11
It seems like the issue arises from how the Cohere embeddings are processed when integrating them into ChromaDB. The error i was encountering is due to the format of the embeddings that the Cohere API returns. ChromaDB expects a homogeneous array (with a consistent shape), but the embeddings from Cohere might is not in the right format out of the box.
To fix this, you need i updated the CohereEmbeddingFunction
class. Specifically, the problem lies in how the Cohere embeddings are extracted from the response. Here's a modified version of the CohereEmbeddingFunction
that should work:
import logging
from chromadb.api.types import Documents, EmbeddingFunction, Embeddings
logger = logging.getLogger(__name__)
class CohereEmbeddingFunction(EmbeddingFunction[Documents]):
def __init__(self, api_key: str, model_name: str = "embed-english-v3.0"):
try:
import cohere
except ImportError:
raise ValueError(
"The cohere python package is not installed. Please install it with `pip install cohere`"
)
self._client = cohere.Client(api_key)
self._model_name = model_name
def __call__(self, input: Documents) -> Embeddings:
# Call the Cohere Embedding API for each document.
response = self._client.embed(
texts=input,
model=self._model_name,
input_type="search_document",
embedding_types=['float']
)
# Extract embeddings correctly
return response.embeddings
Upvotes: 0