Solve terrible performance after upgrading from Lucene 4.0 to 4.1

Question

After upgrading from Lucene 4.0 to 4.1 my solution's performance degraded by more than an order of magnitude. The immediate cause is the unconditional compression of stored fields. For now I'm reverting to 4.0, but this is clearly not the way forward; I'm hoping to find a different approach to my solution.

I use Lucene as a database index, meaning my stored fields are quite short: just a few words at most.

I use a CustomScoreQuery where in CustomScoreProvider#customScore I end up loading all candidate documents and perform detailed word-similarity scoring against the query. I employed two levels of heuristic to narrow down the candidate document set (based on Dice's coefficient), but in the last step I need to match up each query word against each document word (they could be in different order) and calculate the total score based on the sum of best word matches.

How could I approach this differently and do my calculation in a way that avoids the pitfall of loading compressed fields during query evaluation?

Marko Topolnik · Accepted Answer

With Lucene 3.x I had this:

new CustomScoreQuery(bigramQuery, new FieldScoreQuery("bigram-count", Type.BYTE)) {
  protected CustomScoreProvider getCustomScoreProvider(IndexReader ir) {
    return new CustomScoreProvider(ir) {
      public double customScore(int docnum, float bigramFreq, float docBigramCount) {
         ... calculate Dice's coefficient using bigramFreq and docBigramCount...
         if (diceCoeff >= threshold) {
           String[] stems = ir.document(docnum).getValues("stems");
           ... calculate document similarity score using stems ...
         }
      }
    };
  }
}

This approach allowed efficient retrieval of cached float values from stored fields, which I used to get the bigram count of a document; it didn't allow retrieving strings, so I needed to load the document to get what I need to calculate document similarity score. It worked okayish until the Lucene 4.1 change to compress stored fields.

The proper way to leverage the enhancements in Lucene 4 is to involve DocValues like this:

new CustomScoreQuery(bigramQuery) {
  protected CustomScoreProvider getCustomScoreProvider(ReaderContext rc) {
    final AtomicReader ir = ((AtomicReaderContext)rc).reader();
    final ValueSource 
       bgCountSrc = ir.docValues("bigram-count").getSource(),
       stemSrc = ir.docValues("stems").getSource();
    return new CustomScoreProvider(rc) {
      public float customScore(int docnum, float bgFreq, float... fScores) {
        final long bgCount = bgCountSrc.getInt(docnum);
        ... calculate Dice's coefficient using bgFreq and bgCount ...
        if (diceCoeff >= threshold) {
          final String stems = 
             stemSrc.getBytes(docnum, new BytesRef())).utf8ToString();
          ... calculate document similarity score using stems ...
        }
      }
    };
  }
}

This resulted in a performance improvement from 16 ms (Lucene 3.x) to 10 ms (Lucene 4.x).

Solve terrible performance after upgrading from Lucene 4.0 to 4.1

Answers (2)

Related Questions