ss123
ss123

Reputation: 13

elasticsearch 1.6 field norm calculation with shingle filter

I am trying to understand the fieldnorm calculation in elasticsearch (1.6) for documents indexed with a shingle analyzer - it does not seem to include shingled terms. If so, is it possible to configure the calculation to include the shingled terms? Specifically, this is the analyzer I used:

{
  "index" : {
    "analysis" : {
        "filter" : {
            "shingle_filter" : {
                "type" : "shingle",
                "max_shingle_size" : 3
            }
        },
        "analyzer" : {
            "my_analyzer" : {
                "type" : "custom",
                "tokenizer" : "standard",
                "filter" : ["word_delimiter", "lowercase", "shingle_filter"]
            }
        }  
    }
 }

}

This is the mapping used:

{
    "docs": {
        "properties": {
            "text" : {"type": "string", "analyzer" : "my_analyzer"}
        }
    }
}

And I posted a few documents:

{"text" : "the"}
{"text" : "the quick"}
{"text" : "the quick brown"}
{"text" : "the quick brown fox jumps"}
...

When using the following query with the explain API,

{
    "query": {
        "match": {
            "text" : "the"
        }
    }
}

I get the following fieldnorms (other details omitted for brevity):

"_source": {
    "text": "the quick"
},
"_explanation": {
    "value": 0.625,
    "description": "fieldNorm(doc=0)"
}

"_source": {
    "text": "the quick brown fox jumps over the"
},
"_explanation": {
    "value": 0.375,
    "description": "fieldNorm(doc=0)"
}

The values seem to suggest that ES sees 2 terms for the 1st document ("the quick") and 7 terms for the 2nd document ("the quick brown fox jumps over the"), excluding the shingles. Is it possible to configure ES to calculate field norm with the shingled terms too (ie. all terms returned by the analyzer)?

Upvotes: 1

Views: 130

Answers (1)

keety
keety

Reputation: 17461

You would need to customize the default similarity by disabling the discount overlap flag.

Example:

{
  "index" : {
      "similarity" : {
          "no_overlap" : {
            "type" : "default",
            "discount_overlaps" : false
          } 
    },
    "analysis" : {
        "filter" : {
            "shingle_filter" : {
                "type" : "shingle",
                "max_shingle_size" : 3
            }
        },
        "analyzer" : {
            "my_analyzer" : {
                "type" : "custom",
                "tokenizer" : "standard",
                "filter" : ["word_delimiter", "lowercase", "shingle_filter"]
            }
        }  
    }
 }
}

Mapping:

{
    "docs": {
        "properties": {
            "text" : {"type": "string", "analyzer" : "my_analyzer", "similarity
" : "no_overlap"}
        }
    }
}

To expand further:

By default overlaps i.e Tokens with 0 position increment are ignored when computing norm

Example below shows the postion of tokens generated by the "my_analyzer" described in OP :

get <index_name>/_analyze?field=text&text=the quick

{
   "tokens": [
      {
         "token": "the",
         "start_offset": 0,
         "end_offset": 3,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "the quick",
         "start_offset": 0,
         "end_offset": 9,
         "type": "shingle",
         "position": 1
      },
      {
         "token": "quick",
         "start_offset": 4,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

According to lucene documentation the length norm calculation for default similarity is implemented as follows :

state.getBoost()*lengthNorm(numTerms)

where numTerms is

if setDiscountOverlaps(boolean) is false
  FieldInvertState.getLength() 
else 
   FieldInvertState.getLength() - FieldInvertState.getNumOverlap()

Upvotes: 1

Related Questions