user2628641
user2628641

Reputation: 2154

Lucene Scoring mechanism

I have 3 product names, they are

  1. Bounty Select-A-Size White Paper Towels 12 Mega Rolls
  2. Bounty Select-A-Size Paper Towels (12 rolls)
  3. Bounty Select-A-Size Paper Towels White 12 Mega Rolls

As you can see, the 1st and 3rd term are the same except the position of word "White". The 2nd term lacks the word "White" and "Mega"

Now, when I run the following code:

public static void main(String[] args) throws IOException, ParseException {
    StandardAnalyzer analyzer = new StandardAnalyzer();

    // 1. create the index
    Directory index = new RAMDirectory();

    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    IndexWriter w = new IndexWriter(index, config);
    addDoc(w, "Bounty Select-A-Size White Paper Towels 12 Mega Rolls");
    addDoc(w, "Bounty Select-A-Size Paper Towels (12 rolls)");
    addDoc(w, "Bounty Select-A-Size Paper Towels White 12 Mega Rolls");
    w.close();

    // 2. query
    String querystr = "Bounty Select-A-Size White Paper Towels 12 Mega Rolls";

    Query q = new QueryParser("title", analyzer).parse(querystr);

    // 3. search
    IndexReader reader = DirectoryReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);
    ScoreDoc[] hits = searcher.search(q, 4).scoreDocs;

    // 4. display results
    System.out.println("Found " + hits.length + " hits.");
    for(int i=0;i<hits.length;++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        System.out.println((i + 1) + ". " + d.get("title") + "\t score " + hits[i].score);
    }

    reader.close();
}

private static void addDoc(IndexWriter w, String title) throws IOException {
    Document doc = new Document();
    doc.add(new TextField("title", title, Field.Store.YES));
    w.addDocument(doc);
}

The result is:

 1. Bounty Select-A-Size White Paper Towels 12 Mega Rolls    score 0.7363191
 2. Bounty Select-A-Size Paper Towels White 12 Mega Rolls    score 0.7363191
 3. Bounty Select-A-Size Paper Towels (12 rolls)     score 0.42395753

so far, so good, the first 2 terms have the same composition, so they score the same.

However, when I extend the number of terms to be searched (same code, but instead of statically input 3, I got about 5000 of them from a file), the scoring changed.

 1. Bounty Select-A-Size White Paper Towels 12 Mega Rolls             4.1677103
 2. Bounty Select-A-Size Paper Towels (12 rolls)                     4.1677103
 3. Bounty Select-A-Size Paper Towels White 12 Mega Rolls            2.874553

My question is:

Is it possible for the score to change this way when data set changed?

If so, how?

If not, then I know there is bug in my code...

Upvotes: 4

Views: 299

Answers (1)

femtoRgon
femtoRgon

Reputation: 33341

That's entirely normal, and not at all indicative of a bug in your code.

Scores can change when the contents of your index change, even if those changes don't seem to have much to do with your particular query. Scores are really only valid within the context of the particular search execution, so their absolute value isn't really the important thing, but that the values make sense relative to other results of the query. In both result sets, the first two have equal score, and the other is significantly lower.

The main reason for the change here will be the idf (inverse document frequency) scoring factor. That is intended to weigh more heavily terms that occur less frequently across the entire index, the thinking being that a common term like "the" is less interesting as a search result than a less common one like "geronimo".

In your case, the ratio between your best result and the third result has narrowed a bit, with the rest of the corpus available, so it would seem that "white" and "mega" are more common (and thus, less interesting) terms than some of the other ones in the query.


An additional note: You can use Lucene's IndexSearcher.explain method to get detailed information on why documents score the way they do:

System.out.println(searcher.explain(query, docNumber).toString());

Upvotes: 1

Related Questions