Reputation: 21
I'm currently figuring out the tire gem (I'm also new to elasticsearch and lucene) and trying some things out. I will need to do some (probably non-trivial) scoring so I try to get a grip on that. I read everything I could find on the web about the scoring formula and am trying to match what I found with an explained query.
If I read the figures correctly, the documents with title "foo foo foo foo" have different score, which is certainly not as intended. I guess I am missing a step during or after indexing, but I could not figure out.
Below is my code. I'm not going exactly the way the tire DSL is intended because I want to figure things out -- things may look more tire-ish at some time later.
require 'tire'
require 'pp'
class Model
INDEX = 'myindex'
TYPE = 'company'
class << self
def delete_index
Tire.index(INDEX) { delete }
end
def create_mapping
Tire.index INDEX do
create mappings: {
TYPE => {
properties: {
title: { type: 'string' }
}
}
}
end
end
def refresh_index
Tire.index INDEX do
refresh
end
end
end
def initialize(attributes = {})
@attributes = attributes.merge(:_id => object_id) #use oid as id, just for testing
end
def _type
TYPE
end
def id
object_id.to_s #convert to string because tire compares to object_id!
end
def index
item = self
Tire.index INDEX do
store item
end
end
def to_indexed_json
@attributes.to_json
end
ENTITIES = [
new(title: "foo foo foo foo"),
new(title: "foo"),
new(title: "bar"),
new(title: "foo bar"),
new(title: "xxx"),
new(title: "foo foo foo foo"),
new(title: "foo foo"),
new(title: "foo bar baz")
]
QUERIES = {
:foo => { query_string: { query: "foo" } },
:all => { match_all: {} }
}
def self.custom_explained_search(q)
Tire.search(Model::INDEX, :wrapper => Model, :explain => true) do |search|
search.query do |query|
query.send :instance_variable_set, :@value, q
end
end
end
end
class Tire::Results::Collection
def explained
@response["hits"]["hits"].map do |hit|
{
"_id" => hit["_id"],
"_explanation" => hit["_explanation"],
"title" => hit["_source"]["title"]
}
end
end
end
Model.delete_index
Model.create_mapping
Model::ENTITIES.each &:index
Model.refresh_index
s = Model.custom_explained_search(Model::QUERIES[:foo])
pp s.results.explained
The printed result is this:
[{"_id"=>"2169251840",
"_explanation"=>
{"value"=>0.54932046,
"description"=>"fieldWeight(_all:foo in 0), product of:",
"details"=>
[{"value"=>1.4142135,
"description"=>"btq, product of:",
"details"=>
[{"value"=>1.4142135, "description"=>"tf(phraseFreq=2.0)"},
{"value"=>1.0, "description"=>"allPayload(...)"}]},
{"value"=>0.7768564, "description"=>"idf(_all: foo=4)"},
{"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=0)"}]},
"title"=>"foo foo foo foo"},
{"_id"=>"2169251720",
"_explanation"=>
{"value"=>0.54932046,
"description"=>"fieldWeight(_all:foo in 1), product of:",
"details"=>
[{"value"=>0.70710677,
"description"=>"btq, product of:",
"details"=>
[{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"},
{"value"=>1.0, "description"=>"allPayload(...)"}]},
{"value"=>0.7768564, "description"=>"idf(_all: foo=4)"},
{"value"=>1.0, "description"=>"fieldNorm(field=_all, doc=1)"}]},
"title"=>"foo"},
{"_id"=>"2169250520",
"_explanation"=>
{"value"=>0.48553526,
"description"=>"fieldWeight(_all:foo in 2), product of:",
"details"=>
[{"value"=>1.0,
"description"=>"btq, product of:",
"details"=>
[{"value"=>1.0, "description"=>"tf(phraseFreq=1.0)"},
{"value"=>1.0, "description"=>"allPayload(...)"}]},
{"value"=>0.7768564, "description"=>"idf(_all: foo=4)"},
{"value"=>0.625, "description"=>"fieldNorm(field=_all, doc=2)"}]},
"title"=>"foo foo"},
{"_id"=>"2169251320",
"_explanation"=>
{"value"=>0.44194174,
"description"=>"fieldWeight(_all:foo in 1), product of:",
"details"=>
[{"value"=>0.70710677,
"description"=>"btq, product of:",
"details"=>
[{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"},
{"value"=>1.0, "description"=>"allPayload(...)"}]},
{"value"=>1.0, "description"=>"idf(_all: foo=1)"},
{"value"=>0.625, "description"=>"fieldNorm(field=_all, doc=1)"}]},
"title"=>"foo bar"},
{"_id"=>"2169250380",
"_explanation"=>
{"value"=>0.27466023,
"description"=>"fieldWeight(_all:foo in 3), product of:",
"details"=>
[{"value"=>0.70710677,
"description"=>"btq, product of:",
"details"=>
[{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"},
{"value"=>1.0, "description"=>"allPayload(...)"}]},
{"value"=>0.7768564, "description"=>"idf(_all: foo=4)"},
{"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=3)"}]},
"title"=>"foo bar baz"},
{"_id"=>"2169250660",
"_explanation"=>
{"value"=>0.2169777,
"description"=>"fieldWeight(_all:foo in 0), product of:",
"details"=>
[{"value"=>1.4142135,
"description"=>"btq, product of:",
"details"=>
[{"value"=>1.4142135, "description"=>"tf(phraseFreq=2.0)"},
{"value"=>1.0, "description"=>"allPayload(...)"}]},
{"value"=>0.30685282, "description"=>"idf(_all: foo=1)"},
{"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=0)"}]},
"title"=>"foo foo foo foo"}]
Am I reading the figures wrong? Or misusing Tire? Maybe just missing some "reindex whole collection" step?
Upvotes: 2
Views: 693
Reputation: 18925
afaik if no explicit sorting field is defined, sorting defaults to (a variant of ) tf * idf (http://en.wikipedia.org/wiki/Tf*idf) .
Literally: term frequency* inverse document frequency.
From wikipedia:
Term frequency (term count): The term count in the given document is simply the number of times a given term appears in that document
inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient
In this case the "term frequency" component of the sorting most likely result in "foo foo foo foo" to score higher than other docs when searching for 'foo'
Moreover, about the effect you see when changing id's: I'm not sure, but I'm guessing it has to do that ES stores docs ordered by id
's internally (I'm not sure about that)...
If that's the case, 2 documents having the same sort score would be sorted based on id as a tiebreaker. You can of course define multiple sorts to change this behavior (e.g: sort=sorta+desc, sortb+desc. In that case sortb is used as tiebreaker for all docs that score the same on scoreA)
Upvotes: 2