schnittchen
schnittchen

Reputation: 21

Why do two identical documents score differently?

I'm currently figuring out the tire gem (I'm also new to elasticsearch and lucene) and trying some things out. I will need to do some (probably non-trivial) scoring so I try to get a grip on that. I read everything I could find on the web about the scoring formula and am trying to match what I found with an explained query.

If I read the figures correctly, the documents with title "foo foo foo foo" have different score, which is certainly not as intended. I guess I am missing a step during or after indexing, but I could not figure out.

Below is my code. I'm not going exactly the way the tire DSL is intended because I want to figure things out -- things may look more tire-ish at some time later.

require 'tire'
require 'pp'

class Model
  INDEX = 'myindex'
  TYPE = 'company'

  class << self
    def delete_index
      Tire.index(INDEX) { delete }
    end

    def create_mapping
      Tire.index INDEX do
        create mappings: {
          TYPE => {
            properties: {
              title: { type: 'string' }
            }
          }
        }
      end
    end

    def refresh_index
      Tire.index INDEX do
        refresh
      end
    end
  end

  def initialize(attributes = {})
    @attributes = attributes.merge(:_id => object_id) #use oid as id, just for testing
  end

  def _type
    TYPE
  end

  def id
    object_id.to_s #convert to string because tire compares to object_id!
  end

  def index
    item = self
    Tire.index INDEX do
      store item
    end
  end

  def to_indexed_json
    @attributes.to_json
  end

  ENTITIES = [
    new(title: "foo foo foo foo"),
    new(title: "foo"),
    new(title: "bar"),
    new(title: "foo bar"),
    new(title: "xxx"),
    new(title: "foo foo foo foo"),
    new(title: "foo foo"),
    new(title: "foo bar baz")
  ]

  QUERIES = {
    :foo => { query_string: { query: "foo" } },
    :all => { match_all: {} }
  }

  def self.custom_explained_search(q)
    Tire.search(Model::INDEX, :wrapper => Model, :explain => true) do |search|
      search.query do |query|
        query.send :instance_variable_set, :@value, q
      end
    end
  end
end

class Tire::Results::Collection
  def explained
    @response["hits"]["hits"].map do |hit|
      {
        "_id" => hit["_id"],
        "_explanation" => hit["_explanation"],
        "title" => hit["_source"]["title"]
      }
    end
  end
end

Model.delete_index
Model.create_mapping
Model::ENTITIES.each &:index
Model.refresh_index
s = Model.custom_explained_search(Model::QUERIES[:foo])
pp s.results.explained

The printed result is this:

[{"_id"=>"2169251840",
  "_explanation"=>
   {"value"=>0.54932046,
    "description"=>"fieldWeight(_all:foo in 0), product of:",
    "details"=>
     [{"value"=>1.4142135,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>1.4142135, "description"=>"tf(phraseFreq=2.0)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},
      {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=0)"}]},
  "title"=>"foo foo foo foo"},
 {"_id"=>"2169251720",
  "_explanation"=>
   {"value"=>0.54932046,
    "description"=>"fieldWeight(_all:foo in 1), product of:",
    "details"=>
     [{"value"=>0.70710677,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},
      {"value"=>1.0, "description"=>"fieldNorm(field=_all, doc=1)"}]},
  "title"=>"foo"},
 {"_id"=>"2169250520",
  "_explanation"=>
   {"value"=>0.48553526,
    "description"=>"fieldWeight(_all:foo in 2), product of:",
    "details"=>
     [{"value"=>1.0,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>1.0, "description"=>"tf(phraseFreq=1.0)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},
      {"value"=>0.625, "description"=>"fieldNorm(field=_all, doc=2)"}]},
  "title"=>"foo foo"},
 {"_id"=>"2169251320",
  "_explanation"=>
   {"value"=>0.44194174,
    "description"=>"fieldWeight(_all:foo in 1), product of:",
    "details"=>
     [{"value"=>0.70710677,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>1.0, "description"=>"idf(_all:  foo=1)"},
      {"value"=>0.625, "description"=>"fieldNorm(field=_all, doc=1)"}]},
  "title"=>"foo bar"},
 {"_id"=>"2169250380",
  "_explanation"=>
   {"value"=>0.27466023,
    "description"=>"fieldWeight(_all:foo in 3), product of:",
    "details"=>
     [{"value"=>0.70710677,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},
      {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=3)"}]},
  "title"=>"foo bar baz"},
 {"_id"=>"2169250660",
  "_explanation"=>
   {"value"=>0.2169777,
    "description"=>"fieldWeight(_all:foo in 0), product of:",
    "details"=>
     [{"value"=>1.4142135,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>1.4142135, "description"=>"tf(phraseFreq=2.0)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>0.30685282, "description"=>"idf(_all:  foo=1)"},
      {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=0)"}]},
  "title"=>"foo foo foo foo"}]

Am I reading the figures wrong? Or misusing Tire? Maybe just missing some "reindex whole collection" step?

Upvotes: 2

Views: 693

Answers (1)

Geert-Jan
Geert-Jan

Reputation: 18925

afaik if no explicit sorting field is defined, sorting defaults to (a variant of ) tf * idf (http://en.wikipedia.org/wiki/Tf*idf) .

Literally: term frequency* inverse document frequency.

From wikipedia:

Term frequency (term count): The term count in the given document is simply the number of times a given term appears in that document

inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient

In this case the "term frequency" component of the sorting most likely result in "foo foo foo foo" to score higher than other docs when searching for 'foo'

Moreover, about the effect you see when changing id's: I'm not sure, but I'm guessing it has to do that ES stores docs ordered by id's internally (I'm not sure about that)...

If that's the case, 2 documents having the same sort score would be sorted based on id as a tiebreaker. You can of course define multiple sorts to change this behavior (e.g: sort=sorta+desc, sortb+desc. In that case sortb is used as tiebreaker for all docs that score the same on scoreA)

Upvotes: 2

Related Questions