Reputation: 5771
My local client creation yml
version: '3.4'
services:
weaviate:
image: cr.weaviate.io/semitechnologies/weaviate:1.25.0
restart: on-failure:0
ports:
- 8080:8080
- 50051:50051
environment:
QUERY_DEFAULTS_LIMIT: 20
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: "./data"
DEFAULT_VECTORIZER_MODULE: text2vec-transformers
ENABLE_MODULES: text2vec-transformers
TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
CLUSTER_HOSTNAME: 'node1'
t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
environment:
ENABLE_CUDA: 0
I create a collection:
client.collections.create(name = "legal_sections",
properties = [wvc.config.Property(name = "content",
description = "The actual section chunk that the answer is to be extracted from",
data_type = wvc.config.DataType.TEXT,
index_searchable = True,
index_filterable = True,
skip_vectorization = True,
vectorize_property_name = False)])
I create the data to be uploaded and then I upload it:
upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
upserts.append(wvc.data.DataObject(
properties = {
'content':content
},
vector = vector
))
client.collections.get("Legal_sections").data.insert_many(upserts)
My custom vectors are of length 1024
upserts[0].vector.shape
output:
(1024,)
I get a random uuid:
coll = client.collections.get("legal_sections")
for i in coll.iterator():
print(i.uuid)
break
output:
386be699-71de-4bad-9022-31173b9df8d2
I check the length of the vector that this object at this uuid has been stored with
coll.query.fetch_object_by_id('386be699-71de-4bad-9022-31173b9df8d2', include_vector=True).vector['default'].__len__()
output:
384
This should be 1024. What am I doing wrong?
Upvotes: 0
Views: 339
Reputation: 51
There is a nice Weaviate Academy Course for BYOV, which might come in handy, and one of the examples even shows how to do it with HuggingFace).
Here is a docker compose configuration, which should suit better your needs - in case you want to use the hugging-face vectorization at a later stage (but it is not enabled by default):
---
version: '3.4'
services:
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: cr.weaviate.io/semitechnologies/weaviate:1.25.1
ports:
- 8080:8080
- 50051:50051
volumes:
- weaviate_data:/var/lib/weaviate
restart: on-failure:0
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'none'
ENABLE_MODULES: 'text2vec-huggingface'
CLUSTER_HOSTNAME: 'node1'
volumes:
weaviate_data:
...
In your collection configuration, you don't specify a vectorizer. If you plan to do all vectorization yourself, then it is best to set it as none, like this (see the example in the docs):
import weaviate.classes as wvc
client.collections.create(
name = "legal_sections",
vectorizer_config=wvc.config.Configure.Vectorizer.none(),
properties = [wvc.config.Property(
name = "content",
description = "The actual section chunk that the answer is to be extracted from",
data_type = wvc.config.DataType.TEXT,
index_searchable = True,
index_filterable = True,
skip_vectorization = True,
vectorize_property_name = False)
]
)
Btw. you don't really need to specify property definition to bring in your own vectors. Also, skip_vectorization
will not be used if you always provide your own vectors.
As a bonus, I would recommend configuring the HuggingFace vectorizer. If you provide vectros with your insert calls, then Weaviate will use the provided vectors by you. Which is what you probably need at this point.
However, if later you get more data, you can send it to Weaviate without the vectors, and the vectorizer will use HuggingFace to generate the vector embeddings for you.
You can configure a collection with a vectorizer (see docs), like this:
from weaviate.classes.config import Configure, Property, DataType
client.collections.create(
name = "legal_sections",
vectorizer_config=[
Configure.NamedVectors.text2vec_huggingface(
name="title_vector",
source_properties=["content"], # the list of properties that should be used for vectorization - if no vector is provided
model="sentence-transformers/all-MiniLM-L6-v2", # your HF model
)
],
properties = [
Property(name = "content", data_type = DataType.TEXT)
]
)
I hope this helps.
Upvotes: 0
Reputation: 5771
This is most probably a bug with weaviate (someone from weaviate can confirm). The embeddings output of the embeddings model has each element of dtype np.float32
.
This leads to 2 issues:
collections.data.insert
raises error that it cannot json serialize float32collections.data.insert_many
simply suppresses this bug and simply encodes using the model given in the yml used to create the clientThe above code works just fine if I convert the embeddings using
vector = [float(i) for i in vector]
That is to say:
upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
upserts.append(wvc.data.DataObject(
properties = {
'content':content
},
vector = vector
))
gets converted to
upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
upserts.append(wvc.data.DataObject(
properties = {
'content':content
},
vector = [float(i) for i in vector]
))
code to replicate the issue with np.float32
The following code works if you don't pass the vector through the np.array
with explicitly specifying the np.float32
dtype
import weaviate
import numpy as np
client = weaviate.connect_to_local()
jeopardy = client.collections.get("JeopardyQuestion")
uuid = jeopardy.data.insert(
properties={
"question": "This vector DB is OSS and supports automatic property type inference on import",
"answer": "Weaviate",
},
vector = list(np.array([0.12345] * 1536, dtype = np.float32))
)
Upvotes: 0