figs_and_nuts
figs_and_nuts

Reputation: 5771

Cant use my own vectors in bring your own vectors in weaviate. It defaults to sentence transformer specified in the yml used to create local server

My local client creation yml

version: '3.4'
services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.25.0
    restart: on-failure:0
    ports:
    - 8080:8080
    - 50051:50051
    environment:
      QUERY_DEFAULTS_LIMIT: 20
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: "./data"
      DEFAULT_VECTORIZER_MODULE: text2vec-transformers
      ENABLE_MODULES: text2vec-transformers
      TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
      CLUSTER_HOSTNAME: 'node1'
  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
    environment:
      ENABLE_CUDA: 0

I create a collection:

client.collections.create(name = "legal_sections", 
                          properties = [wvc.config.Property(name = "content",
                                                           description = "The actual section chunk that the answer is to be extracted from",
                                                           data_type = wvc.config.DataType.TEXT,
                                                           index_searchable = True,
                                                           index_filterable = True,
                                                           skip_vectorization = True,
                                                           vectorize_property_name = False)])

I create the data to be uploaded and then I upload it:

upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
    upserts.append(wvc.data.DataObject(
        properties = {
            'content':content
        },
        vector = vector
    ))

client.collections.get("Legal_sections").data.insert_many(upserts)

My custom vectors are of length 1024

upserts[0].vector.shape
output:
(1024,)

I get a random uuid:

coll = client.collections.get("legal_sections")

for i in coll.iterator():
    print(i.uuid)
    break
output:
386be699-71de-4bad-9022-31173b9df8d2

I check the length of the vector that this object at this uuid has been stored with

coll.query.fetch_object_by_id('386be699-71de-4bad-9022-31173b9df8d2', include_vector=True).vector['default'].__len__()
output:
384

This should be 1024. What am I doing wrong?

Upvotes: 0

Views: 339

Answers (2)

Seba Wita
Seba Wita

Reputation: 51

There is a nice Weaviate Academy Course for BYOV, which might come in handy, and one of the examples even shows how to do it with HuggingFace).

Here is a docker compose configuration, which should suit better your needs - in case you want to use the hugging-face vectorization at a later stage (but it is not enabled by default):

---
version: '3.4'
services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: cr.weaviate.io/semitechnologies/weaviate:1.25.1
    ports:
    - 8080:8080
    - 50051:50051
    volumes:
    - weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'
      ENABLE_MODULES: 'text2vec-huggingface'
      CLUSTER_HOSTNAME: 'node1'
volumes:
  weaviate_data:
...

In your collection configuration, you don't specify a vectorizer. If you plan to do all vectorization yourself, then it is best to set it as none, like this (see the example in the docs):

import weaviate.classes as wvc
client.collections.create(
    name = "legal_sections",

    vectorizer_config=wvc.config.Configure.Vectorizer.none(),
    
    properties = [wvc.config.Property(
        name = "content",
        description = "The actual section chunk that the answer is to be extracted from",
        data_type = wvc.config.DataType.TEXT,
        index_searchable = True,
        index_filterable = True,
        skip_vectorization = True,
        vectorize_property_name = False)
    ]
)

Btw. you don't really need to specify property definition to bring in your own vectors. Also, skip_vectorization will not be used if you always provide your own vectors.

As a bonus, I would recommend configuring the HuggingFace vectorizer. If you provide vectros with your insert calls, then Weaviate will use the provided vectors by you. Which is what you probably need at this point.

However, if later you get more data, you can send it to Weaviate without the vectors, and the vectorizer will use HuggingFace to generate the vector embeddings for you.

You can configure a collection with a vectorizer (see docs), like this:

from weaviate.classes.config import Configure, Property, DataType
client.collections.create(
    name = "legal_sections",

    vectorizer_config=[
        Configure.NamedVectors.text2vec_huggingface(
            name="title_vector",
            source_properties=["content"], # the list of properties that should be used for vectorization - if no vector is provided
            model="sentence-transformers/all-MiniLM-L6-v2", # your HF model
        )
    ],

    properties = [
        Property(name = "content", data_type = DataType.TEXT)
    ]
)

I hope this helps.

Upvotes: 0

figs_and_nuts
figs_and_nuts

Reputation: 5771

This is most probably a bug with weaviate (someone from weaviate can confirm). The embeddings output of the embeddings model has each element of dtype np.float32.

This leads to 2 issues:

  1. collections.data.insert raises error that it cannot json serialize float32
  2. collections.data.insert_many simply suppresses this bug and simply encodes using the model given in the yml used to create the client

The above code works just fine if I convert the embeddings using

vector = [float(i) for i in vector]

That is to say:

upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
    upserts.append(wvc.data.DataObject(
        properties = {
            'content':content
        },
        vector = vector
    ))

gets converted to

upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
    upserts.append(wvc.data.DataObject(
        properties = {
            'content':content
        },
        vector = [float(i) for i in vector]
    ))

code to replicate the issue with np.float32

The following code works if you don't pass the vector through the np.array with explicitly specifying the np.float32 dtype

import weaviate
import numpy as np
client = weaviate.connect_to_local()

jeopardy = client.collections.get("JeopardyQuestion")
uuid = jeopardy.data.insert(
    properties={
        "question": "This vector DB is OSS and supports automatic property type inference on import",
        "answer": "Weaviate",
    },
    vector = list(np.array([0.12345] * 1536, dtype = np.float32))
)

Upvotes: 0

Related Questions