Sameer
Sameer

Reputation: 3183

Elasticsearch: How is docFreq calculated

I am trying to understand how docFreq is calculated. Is it per index, per mapping per field?

I have these results from my query when setting explain to true. When the hit is in mapping ListedName.standard docFreq is low as shown below

 {
              "value" : 16.316673,
              "description" : """weight(ListedName.standard:"eagle pointe" in 48) [PerFieldSimilarity], result of:""",
              "details" : [
                {
                  "value" : 16.316673,
                  "description" : "score(doc=48,freq=1.0 = phraseFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 3.0,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 5.4388914,
                      "description" : "idf(), sum of:",
                      "details" : [
                        {
                          "value" : 1.7870536,
                          "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 35.0,
                              "description" : "docFreq",
                              "details" : [ ]
                            },
                            {
                              "value" : 211.0,
                              "description" : "docCount",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 3.651838,
                          "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 5.0,
                              "description" : "docFreq",
                              "details" : [ ]
                            },
                            {
                              "value" : 211.0,
                              "description" : "docCount",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "phraseFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.0,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.0,
                          "description" : "parameter b (norms omitted for field)",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },

whereas when the hit is in mapping Line1 docFreq is high as shown below

  {
              "value" : 1.1640041,
              "description" : """weight(Line1:"eagle pointe" in 148) [PerFieldSimilarity], result of:""",
              "details" : [
                {
                  "value" : 1.1640041,
                  "description" : "score(doc=148,freq=1.0 = phraseFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 3.0,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.38800138,
                      "description" : "idf(), sum of:",
                      "details" : [
                        {
                          "value" : 0.18813552,
                          "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 171.0,
                              "description" : "docFreq",
                              "details" : [ ]
                            },
                            {
                              "value" : 206.0,
                              "description" : "docCount",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.19986586,
                          "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 169.0,
                              "description" : "docFreq",
                              "details" : [ ]
                            },
                            {
                              "value" : 206.0,
                              "description" : "docCount",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "phraseFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.0,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.0,
                          "description" : "parameter b (norms omitted for field)",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }

Upvotes: 2

Views: 1268

Answers (1)

EricLavault
EricLavault

Reputation: 16095

It should depend on how the scoring model (cf. Similarity) is defined, similarity algorithms can be set on a per-index or per-field basis.

Elasticsearch allows you to configure a scoring algorithm or similarity per field. The similarity setting provides a simple way of choosing a similarity algorithm other than the default BM25, such as TF/IDF.

Now, we can see in the scoring explanation output :

weight(<field>:"eagle pointe" in 48) [PerFieldSimilarity]

In this context, docFreq seems to be restricted to the number of documents which contain the term in that field. However, I didn't find any extended information about this and I'm not sure about the logic behind, because it should depend on the class similarity definition itself and not on the fact of setting a custom one on a specific field.

It's possible to set a default similarity for the entire index and to specify one per field in the mapping settings : see Elasticsearch Reference [7.2] » Index modules » Similarity module.

You may want to check which similarity is used as default and also whether any field mapping overrides it. For testing, I'd try to reset the default to "classic" (tf-idf) and remove any existing override for these 2 fields to double check if docFreq remains consistent across fields or not (which may be a bug).

cf. Lucene's TFIDFSimilarity

Upvotes: 1

Related Questions