Marko Topolnik
Marko Topolnik

Reputation: 200148

Solve terrible performance after upgrading from Lucene 4.0 to 4.1

After upgrading from Lucene 4.0 to 4.1 my solution's performance degraded by more than an order of magnitude. The immediate cause is the unconditional compression of stored fields. For now I'm reverting to 4.0, but this is clearly not the way forward; I'm hoping to find a different approach to my solution.

I use Lucene as a database index, meaning my stored fields are quite short: just a few words at most.

I use a CustomScoreQuery where in CustomScoreProvider#customScore I end up loading all candidate documents and perform detailed word-similarity scoring against the query. I employed two levels of heuristic to narrow down the candidate document set (based on Dice's coefficient), but in the last step I need to match up each query word against each document word (they could be in different order) and calculate the total score based on the sum of best word matches.

How could I approach this differently and do my calculation in a way that avoids the pitfall of loading compressed fields during query evaluation?

Upvotes: 4

Views: 945

Answers (2)

Marko Topolnik
Marko Topolnik

Reputation: 200148

With Lucene 3.x I had this:

new CustomScoreQuery(bigramQuery, new FieldScoreQuery("bigram-count", Type.BYTE)) {
  protected CustomScoreProvider getCustomScoreProvider(IndexReader ir) {
    return new CustomScoreProvider(ir) {
      public double customScore(int docnum, float bigramFreq, float docBigramCount) {
         ... calculate Dice's coefficient using bigramFreq and docBigramCount...
         if (diceCoeff >= threshold) {
           String[] stems = ir.document(docnum).getValues("stems");
           ... calculate document similarity score using stems ...
         }
      }
    };
  }
}

This approach allowed efficient retrieval of cached float values from stored fields, which I used to get the bigram count of a document; it didn't allow retrieving strings, so I needed to load the document to get what I need to calculate document similarity score. It worked okayish until the Lucene 4.1 change to compress stored fields.

The proper way to leverage the enhancements in Lucene 4 is to involve DocValues like this:

new CustomScoreQuery(bigramQuery) {
  protected CustomScoreProvider getCustomScoreProvider(ReaderContext rc) {
    final AtomicReader ir = ((AtomicReaderContext)rc).reader();
    final ValueSource 
       bgCountSrc = ir.docValues("bigram-count").getSource(),
       stemSrc = ir.docValues("stems").getSource();
    return new CustomScoreProvider(rc) {
      public float customScore(int docnum, float bgFreq, float... fScores) {
        final long bgCount = bgCountSrc.getInt(docnum);
        ... calculate Dice's coefficient using bgFreq and bgCount ...
        if (diceCoeff >= threshold) {
          final String stems = 
             stemSrc.getBytes(docnum, new BytesRef())).utf8ToString();
          ... calculate document similarity score using stems ...
        }
      }
    };
  }
}

This resulted in a performance improvement from 16 ms (Lucene 3.x) to 10 ms (Lucene 4.x).

Upvotes: 0

femtoRgon
femtoRgon

Reputation: 33341

In the IndexWriterConfig, you can pass in a Codec, which defined the storage method to be used by the index. This will only take effect when the IndexWriter is constructed (that is, changing the config after construction will have no effect). You'll want to use Lucene40Codec.

Something like:

//You could also simply pass in Version.LUCENE_40 here, and not worry about the Codec
//(though that will likely affect other things as well)
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, analyzer);
config.setCodec(new Lucene40Codec());
IndexWriter writer = new IndexWriter(directory, config);

You could also use Lucene40StoredFieldsFormat directly to get the old, uncompressed stored field format, and pass it back from a custom Codec implementation. You could probably take most of the code from Lucene41Codec, and just replace the storedFieldFormat() method. Might be the more targeted approach, but a touch more complex, and I don't know for sure whether you might run into other issues.

A further note on creating a custom codec, the way the API indicates that you should accomplish this is to extend FilterCodec, and modifying their example a bit to fit:

public final class CustomCodec extends FilterCodec {

 public CustomCodec() {
   super("CustomCodec", new Lucene41Codec());
 }

 public StoredFieldsFormat storedFieldsFormat() {
   return new Lucene40StoredFieldsFormat();
 }

}


Of course, the other implementation that springs to mind:

I think it's clear to you, as well, that the the issue is right around "I end up loading all candidate documents". I won't editorialize too much on a scoring implementation I don't have complete details on or understanding of, but it sounds like your fighting against Lucene's architecture to make it do what you want. Stored fields shouldn't be used for scoring, generally, and you can expect performance to suffer very noticeably as a result using the 4.0 stored field format, as well, though to a somewhat lesser extent. Might there be a better implementation, either in terms of scoring algorithm, or in terms of document structure, that will remove the requirement to score documents based on stored fields?

Upvotes: 2

Related Questions