daisy
daisy

Reputation: 23571

Ngram filter works differently than I thought

I created an index with the following mapping,

curl -XPUT http://ubuntu:9200/ngram-test -d '{
    "settings": {
        "analysis": {
            "filter": {
                "mynGram": {
                    "type": "nGram",
                    "min_gram": 1,
                    "max_gram": 10,
                    "token_chars": [ "letter", "digit" ]
                }
            },
            "analyzer": {
                "domain_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "mynGram"]
                }
            }
        }
    },
    "mappings": {
        "assets": {
            "properties": {
                "domain": {
                    "type": "string",
                    "analyzer": "domain_analyzer"
                },
                "tag": {
                    "include_in_parent": true,
                    "type": "nested",
                    "properties": {
                        "name": {
                            "type": "string",
                            "analyzer": "domain_analyzer"
                        }
                    }
                }
            }
        }
    }
}'; echo

Then I added some documents,

curl http://ubuntu:9200/ngram-test/assets/ -d '{
  "domain": "www.example.com",
  "tag": [
    {
      "name": "IIS"
    },
    {
      "name": "Microsoft ASP.NET"
    }
  ]
}'; echo

But from the query validate,

http://ubuntu:9200/ngram-test/_validate/query?q=tag.name:asp.net&explain

The query has become this,

filtered(tag.name:a tag.name:as tag.name:asp tag.name:asp. tag.name:asp.n tag.name:asp.ne tag.name:asp.net tag.name:s tag.name:sp tag.name:sp. tag.name:sp.n tag.name:sp.ne tag.name:sp.net tag.name:p tag.name:p. tag.name:p.n tag.name:p.ne tag.name:p.net tag.name:. tag.name:.n tag.name:.ne tag.name:.net tag.name:n tag.name:ne tag.name:net tag.name:e tag.name:et tag.name:t)->cache(org.elasticsearch.index.search.nested.NonNestedDocsFilter@ad04e78f)

Totally unexpected. I was expecting asp.net* or *asp.net or *asp.net* like queries, not things like tag.name:a,

That means when I query for asp.net, things like alex will appear in search result as well, that's totally wrong.

Did I miss something?

EDIT

I increased min_gram to 5, and added search_analyzer

        "tag": {
            "include_in_parent": true,
            "type": "nested",
            "properties": {
                "name": {
                    "type": "string",
                    "analyzer": "domain_analyzer",
                    "search_analyzer": "standard"
                }
            }
        }

But from validate, it is still unexpected:

# http://ubuntu:9200/tag-test/assets/_validate/query?explain&q=tag.name:microso
filtered(tag.name:micro tag.name:micros tag.name:microso tag.name:icros tag.name:icroso tag.name:croso)->cache(_type:assets)

Hmm ... it still contains search for icros icroso croso

Upvotes: 0

Views: 125

Answers (1)

Val
Val

Reputation: 217514

An nGram token filter will split your tokens at the character level. If all you need is to split on words, your whitespace tokenizer already does the job.

Using the elyzer tool, you get insights into each step of the analysis process. Using your analyzer, it yields this:

> elyzer --es localhost:9200 --index ngram --analyzer domain_analyzer --text "Microsoft ASP.NET"

TOKENIZER: whitespace
{1:Microsoft}   {2:ASP.NET} 
TOKEN_FILTER: lowercase
{1:microsoft}   {2:asp.net} 
TOKEN_FILTER: mynGram
{1:m,mi,mic,micr,micro,micros,microso,microsof,microsoft,i,ic,icr,icro,icros,icroso,icrosof,icrosoft,c,cr,cro,cros,croso,crosof,crosoft,r,ro,ros,roso,rosof,rosoft,o,os,oso,osof,osoft,s,so,sof,soft,o,of,oft,f,ft,t}   {2:a,as,asp,asp.,asp.n,asp.ne,asp.net,s,sp,sp.,sp.n,sp.ne,sp.net,p,p.,p.n,p.ne,p.net,.,.n,.ne,.net,n,ne,net,e,et,t}

Although what you seem to be willing is more something like this:

TOKENIZER: whitespace
{1:Microsoft}   {2:ASP.NET} 
TOKEN_FILTER: lowercase
{1:microsoft}   {2:asp.net} 

And that can be achieved by removing the mynGram token filter from your analyzer.

Upvotes: 1

Related Questions