user2017748
user2017748

Reputation: 49

Default boost/scoring with Examine?

I'm trying to figure out how Examine scores it's results. I do find information about Lucene and that it uses VSM and Boolean Model but surely either Lucene or Examine prioritize certain index fields more then others? For example, does an occurence of the term in the name/h1 boost it more then if it occures in the text field?

Upvotes: 1

Views: 410

Answers (1)

Nurhak Kaya
Nurhak Kaya

Reputation: 1781

Examine is a library that sits on top of Lucene.Net, a high-performance search engine library and Lucene's scoring greatly depends on how documents are indexed.

If you check Lucene's IndexSearcher.explain(Query, doc), you will see that the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.

Further details from the same Lucene document;

Fields and Documents

In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (tokenized, stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.

Score Boosting

Lucene allows influencing search results by "boosting" at different times:

  • Index-time boost by calling Field.setBoost() before a document is added to the index.
  • Query-time boost by setting a boost on a query clause, calling Query.setBoost().

Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:

  • All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
  • The boost is then encoded into a normalization value by the Similarity object at index-time: computeNorm(). The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
  • Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.

Changing Scoring — Similarity

Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity(Similarity) and at query-time with IndexSearcher.setSimilarity(Similarity). Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.

You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).

Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or DocValues and integrate them into the score.

See the org.apache.lucene.search.similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.

Upvotes: 2

Related Questions