Reputation: 23571
I created an index with the following mapping,
curl -XPUT http://ubuntu:9200/ngram-test -d '{
"settings": {
"analysis": {
"filter": {
"mynGram": {
"type": "nGram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [ "letter", "digit" ]
}
},
"analyzer": {
"domain_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "mynGram"]
}
}
}
},
"mappings": {
"assets": {
"properties": {
"domain": {
"type": "string",
"analyzer": "domain_analyzer"
},
"tag": {
"include_in_parent": true,
"type": "nested",
"properties": {
"name": {
"type": "string",
"analyzer": "domain_analyzer"
}
}
}
}
}
}
}'; echo
Then I added some documents,
curl http://ubuntu:9200/ngram-test/assets/ -d '{
"domain": "www.example.com",
"tag": [
{
"name": "IIS"
},
{
"name": "Microsoft ASP.NET"
}
]
}'; echo
But from the query validate,
http://ubuntu:9200/ngram-test/_validate/query?q=tag.name:asp.net&explain
The query has become this,
filtered(tag.name:a tag.name:as tag.name:asp tag.name:asp. tag.name:asp.n tag.name:asp.ne tag.name:asp.net tag.name:s tag.name:sp tag.name:sp. tag.name:sp.n tag.name:sp.ne tag.name:sp.net tag.name:p tag.name:p. tag.name:p.n tag.name:p.ne tag.name:p.net tag.name:. tag.name:.n tag.name:.ne tag.name:.net tag.name:n tag.name:ne tag.name:net tag.name:e tag.name:et tag.name:t)->cache(org.elasticsearch.index.search.nested.NonNestedDocsFilter@ad04e78f)
Totally unexpected. I was expecting asp.net*
or *asp.net
or *asp.net*
like queries, not things like tag.name:a
,
That means when I query for asp.net
, things like alex
will appear in search result as well, that's totally wrong.
Did I miss something?
I increased min_gram to 5, and added search_analyzer
"tag": {
"include_in_parent": true,
"type": "nested",
"properties": {
"name": {
"type": "string",
"analyzer": "domain_analyzer",
"search_analyzer": "standard"
}
}
}
But from validate, it is still unexpected:
# http://ubuntu:9200/tag-test/assets/_validate/query?explain&q=tag.name:microso
filtered(tag.name:micro tag.name:micros tag.name:microso tag.name:icros tag.name:icroso tag.name:croso)->cache(_type:assets)
Hmm ... it still contains search for icros icroso croso
Upvotes: 0
Views: 125
Reputation: 217514
An nGram token filter will split your tokens at the character level. If all you need is to split on words, your whitespace tokenizer already does the job.
Using the elyzer tool, you get insights into each step of the analysis process. Using your analyzer, it yields this:
> elyzer --es localhost:9200 --index ngram --analyzer domain_analyzer --text "Microsoft ASP.NET"
TOKENIZER: whitespace
{1:Microsoft} {2:ASP.NET}
TOKEN_FILTER: lowercase
{1:microsoft} {2:asp.net}
TOKEN_FILTER: mynGram
{1:m,mi,mic,micr,micro,micros,microso,microsof,microsoft,i,ic,icr,icro,icros,icroso,icrosof,icrosoft,c,cr,cro,cros,croso,crosof,crosoft,r,ro,ros,roso,rosof,rosoft,o,os,oso,osof,osoft,s,so,sof,soft,o,of,oft,f,ft,t} {2:a,as,asp,asp.,asp.n,asp.ne,asp.net,s,sp,sp.,sp.n,sp.ne,sp.net,p,p.,p.n,p.ne,p.net,.,.n,.ne,.net,n,ne,net,e,et,t}
Although what you seem to be willing is more something like this:
TOKENIZER: whitespace
{1:Microsoft} {2:ASP.NET}
TOKEN_FILTER: lowercase
{1:microsoft} {2:asp.net}
And that can be achieved by removing the mynGram
token filter from your analyzer.
Upvotes: 1