Yumumu
Yumumu

Reputation: 71

openai embedding the same text but return the different vectors

I am trying OpenAI Embedding API now. But I found one issue. When I emebedding the same text again and again, I got the different vectors array.

The text content is baby is crying, and the model is text-embedding-ada-002(MODEL GENERATION: V2). I run the code in a for loop 5 times, I got the different vector values. For example, the first vector value is

"-0.017496677", "-0.017429505", "-0.017429505", "-0.017429505" and "-0.017496677"

I think for the same text content, after embedding it should return the same vectors. Is it right?

Upvotes: 7

Views: 5161

Answers (2)

Yilmaz
Yilmaz

Reputation: 49571

From here

We just faced the same issues for the first time here when using the openai-python package.

We did some tests and around 11% of them were considerably different, even being near in the vector space.

UPDATE: For anyone facing this issue, the embeddings’ endpoint is deterministic. The reason to this difference is caused by the OpenAI Python package, as it uses base64 as the default encoding format, while others don’t.

if you dive into the ibrary code:

class Embedding(EngineAPIResource):
    OBJECT_NAME = "embeddings"

    @classmethod
    def create(cls, *args, **kwargs):
        start = time.time()
        timeout = kwargs.pop("timeout", None)

        user_provided_encoding_format = kwargs.get("encoding_format", None)

        # If encoding format was not explicitly specified, we opaquely use base64 for performance
        if not user_provided_encoding_format:
            kwargs["encoding_format"] = "base64"

from this github repo

Displaying coordinates of text embeddings retrieved using the OpenAI Python library shows more digits than when the embeddings are retrieved explicitly from the API endpoint or using most other libraries. This repository explores why that is, how to get this behavior (and by the same mechanism) when working in other languages, and why one should not usually bother to do so.

More specifically, this repository is a collection of code examples and documentation for the encoding_format argument to the OpenAI embeddings API, which, when set to base64, will send raw floats encoded in Base64. The OpenAI Python library uses that under the hood.

Upvotes: 3

Hritik Sharma
Hritik Sharma

Reputation: 2020

  • The vectors for same input (sentence) should be same (very similar) to each other.

  • If not, then while searching for similar context from the vector database, the results won't be accurate (correct).

  • I found this to be very helpful read: Openai discussion

  • Quoting from the discussion forum:

Embeddings only return vectors. The vector is the same for the same input, same model, and the same API endpoint. But we have seen differences between the OpenAI endpoint and the Azure endpoint for the same model. So a pick an endpoint and stick with it to avoid any differences.

There could be very slight roundoff errors in the embedding when calling it over and over for the same (above) configuration, but this is in the noise and won’t effect your search result

Upvotes: 3

Related Questions