timsworth
timsworth

Reputation: 241

java Lucene best match is not an exact match

Lucene scoring seems to completely elude my understanding.

I have a set of documents for the following:

Senior Education Recruitment Consultant
Senior IT Recruitment Consultant
Senior Recruitment Consultant

These have been analysed using EnglishAnalyzer.

The search query is built with a QueryParser using EnglishAnalyzer as well.

When I search for Senior Recruitment Consultant every one of the above documents are returned with the same score, where the desired (and expected) result would be Senior Recruitment Consultant as the top result.

Is there a straightforward way of achieving the desired behaviour that I've missed?

Here is my debugging output:

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22157) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)
  2.3421772 = (MATCH) weight(Title:recruit in 22157) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)
  1.2005073 = (MATCH) weight(Title:consult in 22157) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22292) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)
  2.3421772 = (MATCH) weight(Title:recruit in 22292) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)
  1.2005073 = (MATCH) weight(Title:consult in 22292) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22494) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)
  2.3421772 = (MATCH) weight(Title:recruit in 22494) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)
  1.2005073 = (MATCH) weight(Title:consult in 22494) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)


Senior Education Recruitment Consultant 4.6491017
Senior IT Recruitment Consultant 4.6491017
Senior Recruitment Consultant 4.6491017

Upvotes: 3

Views: 877

Answers (2)

femtoRgon
femtoRgon

Reputation: 33351

The only scoring element you have to rely on is the lengthnorm.

Lengthnorm is stored with the document at index time, along with the field's boost. It serves to score shorter documents a bit higher.

So why isn't it working? You have two problems:

First: Norms are stored with an extremely lossy compression. They occupy only a single byte, and have about 1 significant decimal digit of precision. So, basically, the difference isn't quite big enough to impact the score.

On the rationale for this lossiness, from the DefaultSimilarity documentation:

...given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.

Second: "IT" is a stop word in english. You mean "Information Technology", but all the analyzer sees is the common english pronoun. And no matter how many stop words you throw into the field, they won't impact the lengthnorm.

Here's a test showing some results I came up with:

Senior Education Recruitment Consultant ::: 0.732527
Senior IT Recruitment Consultant ::: 0.732527
Senior Recruitment Consultant ::: 0.732527
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.732527
Senior Education Recruitment Consultant Of Justice ::: 0.64096117
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.3662635

As you see, with "Senior Education Recruitment Consultant Of Justice" we add just one more search term, and lengthnorm starts making the difference. But with "if and but Senior IT IT IT IT IT Recruitment this that Consultant" will still see no difference, because all of the added terms are common english stop words.


The solution: You could fix the norm precision issue with a custom similarity implementation that wouldn't be all that difficult to code (copy DefaultSimilarity, and implement a non-lossy encodeNormValue and decodeNormValue). You could also set up the analyzer with a custom, or empty, stop word list (via the EnglishAnalyzer ctor).

However, that might be throwing the baby out with the bathwater. If it's really important that precise matches be scored higher, you might be better served by expressing that with your query, like this:

\"Senior Recruitment Consultant\" Senior Recruitment Consultant

Results:

Senior Recruitment Consultant ::: 1.465054
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.732527
Senior Education Recruitment Consultant ::: 0.27469763
Senior IT Recruitment Consultant ::: 0.27469763
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.27469763
Senior Education Recruitment Consultant Of Justice ::: 0.24036042

Upvotes: 3

Zielu
Zielu

Reputation: 8562

Normal lucene ranking is frequency based, and distance between words is not taken into account.

BUT, you can add proximity search term, which requires words within predefiend distance to do the trick (however you kind of need to know how many words are in your query.

There is the answer to similar problem on SO Lucene.Net: Relevancy by distance between words

Upvotes: 0

Related Questions