Reputation: 2042
I am trying to do semantic search with Elasticsearch using tensorflow_hub, but I get RequestError: RequestError(400, 'search_phase_execution_exception', 'runtime error')
. From search_phase_execution_exception I suppose that with corrupted data(from this stack question) My document structure looks like this
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"mappings": {
"dynamic": "true",
"_source": {
"enabled": "true"
},
"properties": {
"id": {
"type":"keyword"
},
"title": {
"type": "text"
},
"abstract": {
"type": "text"
},
"abs_emb": {
"type":"dense_vector",
"dims":512
},
"timestamp": {
"type":"date"
}
}
}
}
And I create a document using elasticsearch.indices.create
.
es.indices.create(index=index, body='my_document_structure')
res = es.indices.delete(index=index, ignore=[404])
for i in range(100):
doc = {
'timestamp': datetime.datetime.utcnow(),
'id':id[i],
'title':title[0][i],
'abstract':abstract[0][i],
'abs_emb':tf_hub_KerasLayer([abstract[0][i]])[0]
}
res = es.index(index=index, body=doc)
for my semantic search I use this code
query = "graphene" query_vector = list(embed([query])[0])
script_query = {
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, doc['abs_emb']) + 1.0",
"params": {"query_vector": query_vector}
}
}
}
response = es.search(
index=index,
body={
"size": 5,
"query": script_query,
"_source": {"includes": ["title", "abstract"]}
}
)
I know there are some similar questions in stackoverflow and elsasticsearch, but I couldn't find solution for me. My guess is that the document structure is wrong but I can't figure out what exactly. I used search query code from this repo. The full error message is too long and doesn't seem to contain much information, so I share only last part of it.
~/untitled/elastic/venv/lib/python3.9/site-packages/elasticsearch/connection/base.py in
_raise_error(self, status_code, raw_data)
320 logger.warning("Undecodable raw error response from server: %s", err)
321
--> 322 raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
323 status_code, error_message, additional_info
324 )
RequestError: RequestError(400, 'search_phase_execution_exception', 'runtime error')
An here is the Error from Elasticsearch server.
[2021-04-29T12:43:07,797][WARN ][o.e.c.r.a.DiskThresholdMonitor]
[asmac.local] high disk watermark [90%] exceeded on
[w7lUacguTZWH9xc_lyd0kg][asmac.local][/Users/username/elasticsearch-
7.12.0/data/nodes/0] free: 17.2gb[7.4%], shards will be relocated
away from this node; currently relocating away shards totalling [0]
bytes; the node is expected to continue to exceed the high disk
watermark when these relocations are complete
Upvotes: 5
Views: 14972
Reputation: 3034
I had a similar issue, because I was using doc['text_vector']
instead of 'text_vector'
. (see breaking change at 7.6 version of elasticsearch)
Once I've added json.dumps
I found that 'text_vector'
field was not a dence_vector
, because of this error message:
class org.elasticsearch.index.fielddata.ScriptDocValues$Doubles cannot be cast to class org.elasticsearch.xpack.vectors.query.VectorScriptDocValues$DenseVectorScriptDocValues
And to fix this error, I had to create index with mappings
field set to:
{ "properties": { "text_vector": { "type": "dense_vector", "dims": 3 } } }
There dims
is a size of vector (number of elements in vector).
A function to create index with mapping of types for fields:
def create():
index = 'text_index'
body = {
"settings": {},
"mappings": { "properties": { "text_vector": { "type": "dense_vector", "dims": 3 } } }
}
es.indices.create(index=index, body=body)
click.echo(f"Index {index} is created with settings {json.dumps(body, indent=4)}")
A function to index any string of text:
def index(input_str):
# text_embedding = embed([input_str])[0].numpy().tolist()
text_embedding = [4.2, 3.4, -0.2]
body = {'text': input_str, 'text_vector': text_embedding}
res = es.index(index='text_index', body=body)
click.echo(f"Indexed {input_str} with id {res['_id']}")
A function to execute vector search using elasticsearch for any text string:
def search(search_string):
# search_vector = embed([search_string])[0].numpy().tolist()
search_vector = [4.2, 3.4, -0.2]
body = {
'query': {
'script_score': {
'query': {'match_all': {}},
'script': {
'source': "cosineSimilarity(params.query_vector, 'text_vector') + 1.0",
'params': {'query_vector': search_vector}
}
}
}
}
try:
res = es.search(index='text_index', body=body)
click.echo("Search results:")
for doc in res['hits']['hits']:
click.echo(f"{doc['_id']} {doc['_score']}: {doc['_source']['text']}")
except Exception as inst:
print(type(inst))
print(json.dumps(inst.args, inden
Note: this just an example, adjust mappings
and embedding vectors that are used to index text and to search text according to your embedding model configuration. If it still does not help then read error carefully at json dump.
Full description of issue: https://github.com/Konard/elastic-search/issues/3
Full source code: https://github.com/Konard/elastic-search/commit/1df0748dd8e8a37c29e1d128eedf96d074e5a73f
Upvotes: 1
Reputation: 131
For me the issue was I was using dense_vector
instead of elastiknn_dense_float_vector
which is still open issue. I am converting my vector index to use dense_vector
instead:
https://github.com/alexklibisz/elastiknn/issues/323
Upvotes: 0
Reputation: 11
in my case the error was "Caused by: java.lang.ClassCastException: class org.elasticsearch.index.fielddata.ScriptDocValues$Doubles cannot be cast to class org.elasticsearch.xpack.vect ors.query.VectorScriptDocValues$DenseVectorScriptDocValues"
My mistake was - I removed the ES index before starting ingesting content. The one that had the "type":"dense_vector" field.
It caused ES did not use the correct type for indexing dense vectors: they were stored as useless lists of doubles. In this sense the ES index was 'corrupted': all 'script_score' queries returned 400.
Upvotes: 1
Reputation: 217464
I think you're hitting the following issue and you should update your query to this:
script_query = {
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, 'abs_emb') + 1.0",
"params": {"query_vector": query_vector}
}
}
}
Also make sure that query_vector
contains floats and not doubles
Upvotes: 3