Corentin Limier
Corentin Limier

Reputation: 5016

Opensearch Insufficient number of hits for nested knn queries with efficient filter

What is the bug?

We use an index to store text documents for semantic search purpose. The text being long, we chunk it in paragraph to embed it using all-MiniLM-L6-v2 model. Each chunk being stored in that nested field of the document. Each document has also an account_id attribute that we use when querying (efficient filtering).

Then we do approximative knn queries with lucene hnsw.

From these documentations :

I expect when executing a knn query on this nested field with efficient filter to get at least n hits, n being the minimum between k and the number of documents that match the efficient filter.

But for some specific input vector or query_text, we get less than n hits, and sometimes even 0. For the same filter with a different query, we get the correct n hits.

We have two other indices without nested field (only one vector per document) with the same efficient filter and it works as expected.

Seems similar to this https://github.com/opensearch-project/k-NN/issues/2222 or https://github.com/opensearch-project/k-NN/issues/2339 except the efficient filtering is as simple as a term filter.

How can one reproduce the bug?

Error happens on specific queries so it's hard to reproduce.

Here is the mapping of the index :

{
  "knowledge-index": {
    "mappings": {
      "properties": {
        "accountId": {
          "type": "keyword"
        },
        "id": {
          "type": "keyword"
        },
        "metadata": {
          "type": "text"
        },
        "metadataEmbedding": {
          "type": "nested",
          "properties": {
            "knn": {
              "type": "knn_vector",
              "dimension": 384,
              "method": {
                "engine": "lucene",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
              }
            }
          }
        },
        "timestamp": {
          "type": "date"
        }
      }
    }
  }
}

Here is the query :

GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
  "from": 0,
  "_source": {
    "excludes": [
      "metadataEmbedding"
    ]
  },
  "query": {
    "nested": {
      "score_mode": "max",
      "path": "metadataEmbedding",
      "query": {
        "neural": {
          "metadataEmbedding.knn": {
            "query_text": "<query_text>",
            "model_id": "9QxR8YsBSCN1wquQEH2b",
            "k": <k>,
            "filter": {
                "term": {
                  "accountId":  "<account_id>"
              }
            }
          }
        }
      }
    }
  }
}

For k = 38, I get 6 hits

  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 0.50342417,

But for k = 1000 I get 32 hits, and k = 10000 (max value) 232 hits.

For another query_text value, I have different results where hits is always = k (or the max of documents that match filter which is 232)

I have the same results when converting first the text in vector and use directly the vector without the neural instruction :

POST /_plugins/_ml/_predict/text_embedding/9QxR8YsBSCN1wquQEH2b
{
  "text_docs":[ "<query_text>"],
  "return_number": true,
  "target_response": ["sentence_embedding"]
}

GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
  "size": 5,
  "_source": {
    "excludes": [
      "metadataEmbedding"
    ]
  },
  "query": {
    "nested": {
      "path": "metadataEmbedding",
      "query": {
        "knn": {
          "metadataEmbedding.knn": {
            "vector": [
                 ....
             ],
            "k": 38,
            "filter": {
              "term": {
                "accountId": "<account_id>"
              }
            }
          }
        }
      }
    }
  }
}

What is the expected behavior?

Getting n hits, n being the minimum between k and the number of documents that match the efficient filter.

What is your host/environment?

Do you have any additional context?

Here is the result of GET /_plugins/_knn/stats?pretty on the node :

{
      "max_distance_query_with_filter_requests": 0,
      "graph_memory_usage_percentage": 0,
      "graph_query_requests": 0,
      "graph_memory_usage": 0,
      "cache_capacity_reached": false,
      "load_success_count": 0,
      "training_memory_usage": 0,
      "indices_in_cache": {},
      "script_query_errors": 0,
      "hit_count": 0,
      "knn_query_requests": 2215,
      "total_load_time": 0,
      "miss_count": 0,
      "min_score_query_requests": 0,
      "knn_query_with_filter_requests": 2215,
      "training_memory_usage_percentage": 0,
      "max_distance_query_requests": 0,
      "lucene_initialized": true,
      "graph_index_requests": 0,
      "faiss_initialized": false,
      "load_exception_count": 0,
      "training_errors": 0,
      "min_score_query_with_filter_requests": 0,
      "eviction_count": 0,
      "nmslib_initialized": false,
      "script_compilations": 1,
      "script_query_requests": 2,
      "graph_stats": {
        "refresh": {
          "total_time_in_millis": 0,
          "total": 0
        },
        "merge": {
          "current": 0,
          "total": 0,
          "total_time_in_millis": 0,
          "current_docs": 0,
          "total_docs": 0,
          "total_size_in_bytes": 0,
          "current_size_in_bytes": 0
        }
      },
      "graph_query_errors": 0,
      "indexing_from_model_degraded": false,
      "graph_index_errors": 0,
      "training_requests": 0,
      "script_compilation_errors": 0
    },

Any idea on what could be the issue here ? Am I right to expect k hits for nested fields with efficient filter ?

Thanks for your help.

Upvotes: 0

Views: 25

Answers (0)

Related Questions