Leo
Leo

Reputation: 5132

Dense vector array and cosine similarity

I would like to store an array of dense_vector in my document but this does not work as it does for other data types eg.

PUT my_index
{
  "mappings": {
    "properties": {
      "my_vectors": {
        "type": "dense_vector",
        "dims": 3  
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [[0.5, 10, 6], [-0.5, 10, 10]]
}

returns:

'1 document(s) failed to index.',
    {'_index': 'my_index', '_type': '_doc', '_id': 'some_id', 'status': 400, 'error': 
      {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': 
        {'type': 'parsing_exception', 
         'reason': 'Failed to parse object: expecting token of type [VALUE_NUMBER] but found [START_ARRAY]'
        }
      }
    }

How do I achieve this? Different documents will have a variable number of vectors but never more than a handful.

Also, I would then like to query it by performing a cosineSimilarity for each value in that array. The code below is how I normally do it when I have only one vector in the doc.

"script_score": {
    "query": {
        "match_all": {}
    },
    "script": {
        "source": "(1.0+cosineSimilarity(params.query_vector, doc['my_vectors']))",
        "params": {"query_vector": query_vector}
    }
}

Ideally, I would like the closest similarity or an average.

Upvotes: 7

Views: 12377

Answers (3)

randomsolutions
randomsolutions

Reputation: 2273

I got to this post by attempting to have a set of vectors in my document.

When I do this:

"mappings": {
    "properties": {
        "vectors": {
            "type": "nested",
            "properties": {
                "vector": {
                    "type": "dense_vector",
                    "dims": 768,
                    "index": "true",
                    "similarity": "cosine"
                }
            }   
        },
        "my_text" : {
            "type" : "keyword"
        }
    }
}

I get:

BadRequestError: BadRequestError(400, 'illegal_argument_exception', "[dense_vector] fields cannot be indexed if they're within [nested] mappings")

If I remove the index: true and "similarity": "cosine" then the problem goes away (but I won't be able to use knn which is my main goal).

Hopefully this helps someone.

Upvotes: 0

Glen Smith
Glen Smith

Reputation: 146

The dense_vector datatype expects one array of numeric values per document like so:

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [0.5, 10, 6]
}

To store any number of vectors, you could make the my_vector field a "nested" type which would contain an array of objects, and each object contains a vector:

PUT my_index
{
  "mappings": {
    "properties": {
      "my_vectors": {
        "type": "nested",
        "properties": {
          "vector": {
            "type": "dense_vector",
            "dims": 3  
          }
        }
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [
    {"vector": [0.5, 10, 6]}, 
    {"vector": [-0.5, 10, 10]}
  ]
}

EDIT

Then, to query the documents, you can use the following (as of ES v7.6.1)

{
  "query": {
    "nested": {
      "path": "my_vectors",
      "score_mode": "max", 
      "query": {
        "function_score": {
          "script_score": {
            "script": {
              "source": "(1.0+cosineSimilarity(params.query_vector, 'my_vectors.vector'))",
              "params": {"query_vector": query_vector}
            }
          }
        }
      }
    }
  }
}

Few things to note:

  • The query needs to be wrapped in a nested declaration (due to using nested objects to store the vectors)
  • Because nested objects are separate Lucene documents, the nested objects are scored individually and by default, the parent document is assigned the average score of matching nested documents. You can specify the nested property score_mode to change the scoring behavior. For your case, "max" will score based on largest cosine similarity score which describes documents that are most similar.
  • If you're interested in seeing the scores of each nested vector, you can use the nested property inner_hits.
  • If anyone is curious why +1.0 is added to the cosine similarity score, it's because Cos. Sim. computes values [-1,1], but ElasticSearch cannot have negative scores. Therefore, scores are transformed to [0,2].

Upvotes: 13

Pierre Mallet
Pierre Mallet

Reputation: 7221

The dense_vector datatype is meant to

stores dense vectors of float values (from documentation) .... A dense_vector field is a single-valued field.

In your example, you want to index multiple vectors in the same property. But as said in the documentation your field must be single-valued. If you have multiple vectors for your document they need to be dispatched in different properties.

No workaround :(

So you need to dispatch vectors in different fields then use a loop in your script and keep the most suited value.

Upvotes: 0

Related Questions