Richard Rast
Richard Rast

Reputation: 2026

ElasticSearch fieldNorm is always 1

I have recently begun working with elasticsearch, so I apologize if this is a "basic" question. I've also been in the process of migrating our materials from ES version 1.3 to 2.4 (!) so some things have broken in the process, and queries / etc. that used to work no longer do (or give "bad" results). I've fixed some of these problems, but this is a stumper.

I've read the docs about how relevance scoring is done. My index is processed with an pattern tokenizer (just split into words), then hit with a lowercase filter and an ngram filter (min length 1, max length 3).

Now if I search for the letter "a" then I should get relatively shorter documents first, right? So for example "asian" (which contains two instances of the desired token) should score higher than "Astasia-abasia" (which has six) because proportionally more of its tokens are equal to "a". The proportionality is accounted for by the term frequency and the field norm. Great! This is what I want. But ...

In fact "asian" does not even appear in the first 5000 hits! When I look at ?explain I see that while fieldNorm is present, but always equal to 1. Why is this? How can I fix it?

The index code I'm using is here:

{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "tokenizer": "pattern_tokenizer",
                    "filter": [ "lowercase", "ngram_filter" ]
                }
            },
            "tokenizer": {
                "pattern_tokenizer": {
                    "type": "pattern",
                    "pattern": "[\\]\\[{}()/ ,:;\"&]+"
                }
            },
            "filter": {
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": "1",
                    "max_gram": "3"
                }
            }
        }
    },
    "mappings": {
        "terms": {
            "properties": {
                "code": {
                    "analyzer": "ngram_analyzer",
                    "search_analyzer": "keyword",
                    "type": "string",
                    "norms": {
                        "enabled": true,
                        "loading": "eager"
                    }
                },
                "codeAbbr": {
                    "analyzer": "ngram_analyzer",
                    "search_analyzer": "keyword",
                    "type": "string",
                    "norms": {
                        "enabled": true,
                        "loading": "eager"
                    }
                },
                "term": {
                    "analyzer": "ngram_analyzer",
                    "search_analyzer": "keyword",
                    "type": "string",
                    "norms": {
                        "enabled": true,
                        "loading": "eager"
                    }
                }
            }
        }
    }
}

I don't feel like I should even have to specify the norms attribute (I feel like the above should be the default) but it doesn't matter. If I take them out or put them in, the answer is the same. How can I make fieldNorm work properly?

Upvotes: 3

Views: 198

Answers (1)

Richard Rast
Richard Rast

Reputation: 2026

The answer turned out to be somewhat different than I expected; I hope this answer helps someone else save the time I spent. I did not see this anywhere in the docs I've read, but discovered it through experimentation. My very specific problem can be solved by using an ngram tokenizer rather than an ngram filter, but let me explain why this is.

The issue is when the fieldNorm is computed, and this is one reason why ngram filters and tokenizers are different.

fieldNorm is based on the number of tokens in the document, using the formula given in the docs 1/sqrt(#tokens); there may or may not be a +1 in the denominator, depending on who you ask, but it doesn't really matter for this question. The important thing is that the #tokens figure is computed after tokenization but before filtering.

As far as I know, this is only important with ngram and edge ngram filters, as they are the only ones which change the number of tokens in the document, so perhaps this is why it's not prominently explained in the docs. But here's a couple of use cases to explain why this matters:

  1. Suppose your documents consist of long phrases - descriptions maybe? - and you tokenize with a standard tokenizer, or whatever. Then your field norm is based essentially on the number of words. This might be what you want; it depends on your use case. This way the search favors shorter documents in terms of the number of words (but using long words doesn't penalize your search). If you use an ngram tokenizer instead, the fieldNorm is proportional to the number of characters; so if you use lots of little words, and I use fewer but bigger words, our scores might be the same. Usually not what you want.

  2. Now suppose your documents consist of single words or very short phrases (like mine). If you tokenize with a standard tokenizer, most of the documents will have fieldNorm 1, since they're single words. However, I want my search to prioritize shorter words (as an approximation for "common words"), so this doesn't help. Instead I'll use an ngram tokenizer, so longer words get pushed to the bottom, and shorter words float to the top.

Upvotes: 3

Related Questions