Reputation: 49
I'm trying to figure out how Examine scores it's results. I do find information about Lucene and that it uses VSM and Boolean Model but surely either Lucene or Examine prioritize certain index fields more then others? For example, does an occurence of the term in the name/h1 boost it more then if it occures in the text field?
Upvotes: 1
Views: 410
Reputation: 1781
Examine is a library that sits on top of Lucene.Net, a high-performance search engine library and Lucene's scoring greatly depends on how documents are indexed.
If you check Lucene's IndexSearcher.explain(Query, doc)
, you will see that the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Further details from the same Lucene document;
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (tokenized, stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
computeNorm()
. The actual encoding depends upon the Similarity
implementation, but note that most use a lossy encoding (such as
multiplying the boost with document length or similar, packed into a
single byte!).Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity(Similarity) and at query-time with IndexSearcher.setSimilarity(Similarity). Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or DocValues and integrate them into the score.
See the org.apache.lucene.search.similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Upvotes: 2