Reputation: 6533
I'm working with Hibernate Search for months now, but still I'm not able to digest the relevance it brings. I'm overall satisfied with the results it returns, but even simplest test does not satisfy my expectation.
First test was using the term frequency(tf). Data:
Results I get:
I'm really confused with this scoring effect. My Query is quite complex, but as this test did not have any other field involved, it can be simplified as below: booleanjunction.should(phraseQuery).should(keywordQuery).should(fuzzyQuery)
I've analyzers as below:
StandardFilterFactory
LowerCaseFilterFactory
StopFilterFactory
SnowballPorterFilterFactory for english
My Explanation object https://jsfiddle.net/o51kh3og/
Upvotes: 4
Views: 4061
Reputation: 1301
Scoring calculation is something really complex. Here, you have to begin with the primal equation:
score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )
As you said, you have tf
which means term frequency and its value is the squareroot of the frequency of the term.
But here, as you can see in your explanation, you also have norm
(aka fieldNorm
) which is used in fieldWeight
calculation. Let's take your example:
eklavya eklavya eklavya eklavya eklavya
4.296241 = fieldWeight in 177, product of:
2.236068 = tf(freq=5.0), with freq of:
5.0 = termFreq=5.0
4.391628 = idf(docFreq=6, maxDocs=208)
0.4375 = fieldNorm(doc=177)
eklavya
4.391628 = fieldWeight in 170, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
4.391628 = idf(docFreq=6, maxDocs=208)
1.0 = fieldNorm(doc=170)
Here, eklavya
has a better score than the other because fieldWeight
is the product of tf
, idf
and fieldNorm
. This last one is higher for eklavya
document because he only contains one term.
As above documentation said:
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.
The more terms you have in a field, lower fieldNorm
will be.
Be careful with the value of this field.
So, to conclude, here you have a perfect mix to understand that the score is not calculated only with the frequency but also with the number of term that you have in your field.
Upvotes: 7