doublemax
doublemax

Reputation: 473

Characters to split the user-query in Vespa engine

We split the user-query on ascii spaces to create a weakAnd(...).

The user-input "Watch【Docudrama】" does not contain a whitespace - but throws an error.

Question: Which codepoints beside whitespaces should be used to split the query?

YQL (fails):

select * from post where text contains "Watch【Docudrama】" limit 1;

YQL (works):

select * from post where weakAnd(text contains "Watch",text contains "【Docudrama】") limit 1;

Error message:

{
  "root": {
    "id": "toplevel",
    "relevance": 1,
    "fields": {
      "totalCount": 0
    },
    "errors": [
      {
        "code": 4,
        "summary": "Invalid query parameter",
        "source": "content",
        "message": "Can not add WORD_ALTERNATIVES text:[ Watch【Docudrama】(1.0) watch(0.7) ] to a segment phrase"
      }
    ]
  }
}

Upvotes: 1

Views: 193

Answers (1)

andreer
andreer

Reputation: 366

Are you sure you need to use WAND for this? Try setting the user query grammar to "any" (default is "all"), which will use the "OR" operator for user supplied terms. There is an example here: https://docs.vespa.ai/documentation/reference/query-language-reference.html#userinput

The process of splitting up the query is known as Tokenization. This is a complex and language dependent process, Vespa uses Apache OpenNLP to do this (and more): https://docs.vespa.ai/documentation/linguistics.html has more information and also references to the code which performs this operation.

If you really want to use WAND, instead of reimplementing the query parsing logic outside Vespa, I suggest you create a Java searcher which descends the query tree and modifies it by replacing the created AndItem with WeakAndItem. See https://docs.vespa.ai/documentation/searcher-development.html and the code example here: https://docs.vespa.ai/documentation/advanced-ranking.html

Upvotes: 4

Related Questions