ArangoSearch filter records before searching

Question

Performance issue with arangosearch

I have document collection like:

{
  "passage": "Some long text",
  "meta": {
    "language": "en",
    "Region":"Asia Pacific"
  },
  "document_name": "my document.pdf"
}

Now, to enable full-text search I created a view and link configuration like:

"links": {
    "my_coll": {
      "analyzers": [
        "myAnalyzer"
      ],
      "fields": {
        "passage": {"analyzers": [
        "myAnalyzer"
      ]}
      },
      "includeAllFields": false,
      "storeValues": "none",
      "trackListPositions": false
    }
  }

Now I want to search from the passage but for particular language and region

My query like:

LET token = tokens("My text to be search", "myAnalyzer")
for docs in my_vw
    search analyzer(token any == docs.passage, "myAnalyzer")
    filter docs.meta.language=="en"
    filter docs.meta.Region=="Global"
    sort BM25(docs) desc
    limit 50
return {passage: docs.passage, score: BM25(docs)}

This query is taking around 4sec to answer. there are 3,227,261 documents in the collection.

Execution plan:

 Id   NodeType               Est.   Comment
  1   SingletonNode             1   * ROOT
  3   EnumerateViewNode   3227261     - FOR docs IN my_vw SEARCH ANALYZER(([ "my", "token" ] any == docs.`passage`), "myAnalyzer") LET #10 = BM25(docs)   /* view query */
  4   CalculationNode     3227261       - LET #2 = ((docs.`meta`.`language` == "en") && (docs.`meta`.`Region` == "myAnalyzer"))   /* simple expression */
  5   FilterNode          3227261       - FILTER #2
  9   SortNode            3227261       - SORT #10 DESC   /* sorting strategy: constrained heap */
 10   LimitNode                50       - LIMIT 0, 50
 11   CalculationNode          50       - LET #8 = { "passage" : docs.`passage`, "score" : #10 }   /* simple expression */
 12   ReturnNode               50       - RETURN #8

It is selecting all the documents first and then applying filters. Is there any way to apply the filter first and then search?

Can you help to improve this query performance?

Andrey Abramov · Accepted Answer

I suggest you to avoid post-filtering. You'd better to index meta.language and meta.language fields with the adjusted definition:

"links": {
    "my_coll": {
      "analyzers": [
        "myAnalyzer"
      ],
      "fields": {
        "passage": {"analyzers": [ "myAnalyzer" ]},
        "meta" : { "fields" : { "language":{}, "Region":{} } }
      },
      "includeAllFields": false,
      "storeValues": "none",
      "trackListPositions": false
    }
  }

Then you can transform your query to:

LET token = tokens("My text to be search", "myAnalyzer")
for docs in my_vw
    search analyzer(token any == docs.passage, "myAnalyzer")
           AND docs.meta.language=="en"
           AND docs.meta.Region=="Global"
    sort BM25(docs) desc
    limit 50
return {passage: docs.passage, score: BM25(docs)}

ArangoSearch filter records before searching

Answers (1)

Related Questions