Nitin
Nitin

Reputation: 2911

ArangoSearch filter records before searching

Performance issue with arangosearch

I have document collection like:

{
  "passage": "Some long text",
  "meta": {
    "language": "en",
    "Region":"Asia Pacific"
  },
  "document_name": "my document.pdf"
}

Now, to enable full-text search I created a view and link configuration like:

"links": {
    "my_coll": {
      "analyzers": [
        "myAnalyzer"
      ],
      "fields": {
        "passage": {"analyzers": [
        "myAnalyzer"
      ]}
      },
      "includeAllFields": false,
      "storeValues": "none",
      "trackListPositions": false
    }
  }

Now I want to search from the passage but for particular language and region

My query like:

LET token = tokens("My text to be search", "myAnalyzer")
for docs in my_vw
    search analyzer(token any == docs.passage, "myAnalyzer")
    filter docs.meta.language=="en"
    filter docs.meta.Region=="Global"
    sort BM25(docs) desc
    limit 50
return {passage: docs.passage, score: BM25(docs)}

This query is taking around 4sec to answer. there are 3,227,261 documents in the collection.

Execution plan:

 Id   NodeType               Est.   Comment
  1   SingletonNode             1   * ROOT
  3   EnumerateViewNode   3227261     - FOR docs IN my_vw SEARCH ANALYZER(([ "my", "token" ] any == docs.`passage`), "myAnalyzer") LET #10 = BM25(docs)   /* view query */
  4   CalculationNode     3227261       - LET #2 = ((docs.`meta`.`language` == "en") && (docs.`meta`.`Region` == "myAnalyzer"))   /* simple expression */
  5   FilterNode          3227261       - FILTER #2
  9   SortNode            3227261       - SORT #10 DESC   /* sorting strategy: constrained heap */
 10   LimitNode                50       - LIMIT 0, 50
 11   CalculationNode          50       - LET #8 = { "passage" : docs.`passage`, "score" : #10 }   /* simple expression */
 12   ReturnNode               50       - RETURN #8

It is selecting all the documents first and then applying filters. Is there any way to apply the filter first and then search?

Can you help to improve this query performance?

Upvotes: 0

Views: 191

Answers (1)

Andrey Abramov
Andrey Abramov

Reputation: 141

I suggest you to avoid post-filtering. You'd better to index meta.language and meta.language fields with the adjusted definition:

"links": {
    "my_coll": {
      "analyzers": [
        "myAnalyzer"
      ],
      "fields": {
        "passage": {"analyzers": [ "myAnalyzer" ]},
        "meta" : { "fields" : { "language":{}, "Region":{} } }
      },
      "includeAllFields": false,
      "storeValues": "none",
      "trackListPositions": false
    }
  }

Then you can transform your query to:

LET token = tokens("My text to be search", "myAnalyzer")
for docs in my_vw
    search analyzer(token any == docs.passage, "myAnalyzer")
           AND docs.meta.language=="en"
           AND docs.meta.Region=="Global"
    sort BM25(docs) desc
    limit 50
return {passage: docs.passage, score: BM25(docs)}

Upvotes: 1

Related Questions