Reputation: 1481
Very recently Elasticsearch has implemented vector-based queries. This means that each document includes a vector as a field, and we can use a new vector to find a match in our corpus.
You can find more information in this link. The Elasticsearch team explain there how this should work, and even provide a query string:
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'])",
"params": {
"queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
}
}
}
}
}
I have installed the latest Elasticsearch version, in particular, curl -XGET 'http://localhost:9200'
gives me this info:
"version" : {
"number" : "7.3.0",
"build_flavor" : "default",
"build_type" : "deb",
"build_hash" : "de777fa",
"build_date" : "2019-07-24T18:30:11.767338Z",
"build_snapshot" : false,
"lucene_version" : "8.1.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
}
I am working with the Python library elasticsearch
(elasticsearch_dsl
as well, but not yet for these queries). I can set up my Elasticsearch index, load documents and make queries. For example, this works:
query_body = {
"query": {
"query_string": {
"query": "Some text",
"default_field": "some_field"
}
}
}
es.search(index=my_index, body=query_body)
However, when I try the same code for a query almost identical to the official example, it does not work.
My query:
query_body = {
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "cosineSimilaritySparse(params.queryVector, doc['my_embedding_field_name'])",
"params": {
"queryVector": {"1703": 0.0261, "1698": 0.0261, "2283": 0.0459, "2263": 0.0523, "3741": 0.0349}
}
}
}
}
}
Note that the sparse vector in the query is an example I made, making sure that the keys are found in at least the embedding vector of one of my documents (I am not sure this should be problematic, but in case).
The error:
elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'runtime error')
The error message does not help me a lot moving forward, and since this is a really new feature, I could not find other help online.
Update: Below is a more complete error message, produced when using curl for the query.
The core of the error is:
"type" : "illegal_argument_exception",
"reason" : "Variable [embedding] is not defined."
The complete message is:
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "compile error",
"script_stack" : [
"... (params.queryVector, doc[embedding])",
" ^---- HERE"
],
"script" : "cosineSimilaritySparse(params.queryVector, doc[embedding])",
"lang" : "painless"
},
{
"type" : "script_exception",
"reason" : "compile error",
"script_stack" : [
"... (params.queryVector, doc[embedding])",
" ^---- HERE"
],
"script" : "cosineSimilaritySparse(params.queryVector, doc[embedding])",
"lang" : "painless"
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "test-index",
"node" : "216BQPYoQ-SIzcrV1jzMOQ",
"reason" : {
"type" : "query_shard_exception",
"reason" : "script_score: the script could not be loaded",
"index_uuid" : "e1kpygbHRai9UL8_0Lbsdw",
"index" : "test-index",
"caused_by" : {
"type" : "script_exception",
"reason" : "compile error",
"script_stack" : [
"... (params.queryVector, doc[embedding])",
" ^---- HERE"
],
"script" : "cosineSimilaritySparse(params.queryVector, doc[embedding])",
"lang" : "painless",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "Variable [embedding] is not defined."
}
}
}
},
{
"shard" : 0,
"index" : "tutorial",
"node" : "216BQPYoQ-SIzcrV1jzMOQ",
"reason" : {
"type" : "query_shard_exception",
"reason" : "script_score: the script could not be loaded",
"index_uuid" : "n2FNFgAFRiyB_efJKfsGPA",
"index" : "tutorial",
"caused_by" : {
"type" : "script_exception",
"reason" : "compile error",
"script_stack" : [
"... (params.queryVector, doc[embedding])",
" ^---- HERE"
],
"script" : "cosineSimilaritySparse(params.queryVector, doc[embedding])",
"lang" : "painless",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "Variable [embedding] is not defined."
}
}
}
}
],
"caused_by" : {
"type" : "script_exception",
"reason" : "compile error",
"script_stack" : [
"... (params.queryVector, doc[embedding])",
" ^---- HERE"
],
"script" : "cosineSimilaritySparse(params.queryVector, doc[embedding])",
"lang" : "painless",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "Variable [embedding] is not defined."
}
} }, "status" : 400}
Update 2: My documents have this structure:
{"name": "doc_name", "field_1": "doc_id", "field_2": "a_keyword", "text": "a rather long text", "embedding": {"4655": 0.040158602078116556, "4640": 0.040158602078116556}}
Update 3: I am passing a mapping after creating the index, with:
"properties": {
"name": {
"type": "keyword"
},
"field_1": {
"type": "keyword"
},
"field_2": {
"type": "keyword"
},
"text": {
"type": "text"
},
"embedding": {
"type": "sparse_vector"
}
}
and this has removed an error complaining about too many fields (each key in the embedding was taken as a field). But the query error is the same.
Upvotes: 1
Views: 2852
Reputation: 1481
To solve this problem we need to make sure that Elasticsearch understand the vector field ("embedding" in my case) is actually an sparse vector. For this, use:
"properties": {
"name": {
"type": "keyword"
},
"reference": {
"type": "keyword"
},
"jurisdiction": {
"type": "keyword"
},
"text": {
"type": "text"
},
"embedding": {
"type": "sparse_vector"
}
}
More details in this related question.
There are two important things to note:
It is recommended to add +1 to the metric, to avoid negative values.
"source": "cosineSimilaritySparse(params.queryVector,
doc['my_embedding_field_name']) + 1.0"
Credit on these last points goes to jimczi from the Elastic Team (thanks!). See the question on the forums here.
Upvotes: 1