FindingTheOne
FindingTheOne

Reputation: 189

Choosing right Tokenizer in Elastic 5.4 for emulate contains like queries

I am using Elastic 5.4 to implement suggestion/ completion like functionality and facing issue in choosing the right tokenizer for my requirements. Below is example:

There are 4 documents in the index as follows with the content as mentioned below:

DOC 1: Applause

DOC 2: Apple

DOC 3: It is an Apple

DOC 4: Applications

DOC 5: There is_an_appl

Queries

Query 1: Query String 'App' should return all 5 documents.

Query 2: Query String 'Apple' should return only document 2 and document 3.

Query 3: Query String 'Applications' should return only document 4.

Query 4: Query String 'appl' should return all 5 documents.

Tokenizer

I am using the following tokenizer in Elastic and I am seeing all documents returned for Query 2 and Query 3.

The analyzer is applied to fields of type 'text'.

"settings": {
    "analysis": {
      "analyzer": {
        "my_ngram_analyzer": {
          "tokenizer": "my_ngram_tokenizer"
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": "3",
          "max_gram": "3",
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }

How can I restrict the results to return documents which contain an exact match of the query string either as part of existing word or a phrase or a exact word( I have mentioned the expected results are provided in the queries above)?

Upvotes: 1

Views: 118

Answers (1)

Val
Val

Reputation: 217274

That's because you're using an nGram tokenizer instead of edgeNGram one. The latter only indexes prefixes, while the former will index prefixes, suffixes and also sub-parts of your data.

Change your analyzer definition to this instead and it should work as expected:

"settings": {
    "analysis": {
      "analyzer": {
        "my_ngram_analyzer": {
          "tokenizer": "my_ngram_tokenizer"
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "edge_ngram",          <---- change this
          "min_gram": "3",
          "max_gram": "3",
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }

Upvotes: 1

Related Questions