Reputation: 3290
I'm not sure to understand how vector space model is used in lucene scoring.
I read here (https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html) that lucene scores a document as the sum of the tf-idf of each term query (if we omit coordination factor, field length and boosts). I don't understand how vector space model is used.
Space vector model could be used to calculate the similarity between the tf-idf vector of a document and the tf-idf vector of the query. This should give us a CosSimilarity score between the query and a document. The score would be between 0 and 1, so different requests should be easy to compare.
Why not using lucene score ?
Upvotes: 0
Views: 858
Reputation: 421
Lucene uses the 'practical score function' mentioned in your link, which is an approximation of the cosine similarity - extended to support 'practical' features such as boosts.
If you take the vector space cosine similarity formula for a query q and a document d, you have:
s(q, d) = q * d / (||q|| * ||d||)
Considering that q and d are vectors like [tf(t1) * idf(t1), ...]
, and that in the q vector tf(t) is either 1 or 0, the formula becomes:
s(q, d) = ∑( tf(t in d) * idf(t)² )(t in q) / (||q|| * ||d||)
You can further replace ||q||
with 1 / queryNorm(q)
given their definition queryNorm = 1 / √sumOfSquaredWeights
s(q, d) = queryNorm(q) * ∑( tf(t in d) * idf(t)² )(t in q) / ||d||
which is close to the formula they give in the docs:
score(q, d) = queryNorm(q) * coord(q,d) *
∑ ( tf(t in d) * idf(t)² * t.getBoost() * norm(t,d)) (t in q)
||d||
, the norm of the document vector, however, does not have a direct equivalent in the terms of their formula.
Upvotes: 3