Reputation: 7244
I'm using Lucene 5.5.0 for indexing. The following criteria describe my environment:
String
or Long
fields (so no text analysis is required). All of them are stored by lucene. The strings have a maximum length of 255 characters.The current search
method I've implemented, wrapping Lucene's API, looks like this:
public Set<Document> performLuceneSearch(Query query) {
Set<Document> documents = Sets.newHashSet();
// the reader instance is reused as often as possible, and exchanged
// when a write occurs using DirectoryReader.openIfChanged(...).
if (this.reader.numDocs() > 0) {
// note that there cannot be a limiting number on the result set.
// I absolutely need to retrieve ALL matching documents, so I have to
// make use of 'reader.numDocs()' here.
TopDocs topDocs = this.searcher.search(query, this.reader.numDocs());
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
int documentId = scoreDoc.doc;
Document document = this.reader.document(documentId);
documents.add(document);
}
}
return Collections.unmodifiableSet(documents);
}
Is there any way to do this faster/better, considering my environment outlined above? Especially given that I don't require any ranking or sorting (but rather completeness of the result), I feel that there should be some corners to cut and make things faster.
Upvotes: 1
Views: 1390
Reputation: 5974
There are a couple of things you can do to speed up the search.
First, if you don't use scoring, you should disable norms, this will make the index smaller.
Since you only use StringField
and LongField
(as opposed to, say, the TextField
with a keyword tokenizer), norms are disabled for these Field, so you've already got that one.
Second, you should structure and wrap your query, so that you minimize the calculation of actual scores. That is, if you use BooleanQuery
, use Occur.FILTER
instead of Occur.MUST
. Both have the same inclusion logic, but filter doesn't score. For other queries, consider wrapping them in a ConstantScoreQuery
. However, this might not be necessary at all (explanation follows).
Third, use a custom Collector
. The default search method is meant for small, ranked or sorted result sets, but your use case doesn't fit that pattern. Here is a sample implementation:
import org.apache.lucene.document.Document;
import org.apache.lucene.index.LeafReader;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.search.SimpleCollector;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
final class AllDocumentsCollector extends SimpleCollector {
private final List<Document> documents;
private LeafReader currentReader;
public AllDocumentsCollector(final int numDocs) {
this.documents = new ArrayList<>(numDocs);
}
public List<Document> getDocuments() {
return Collections.unmodifiableList(documents);
}
@Override
protected void doSetNextReader(final LeafReaderContext context) {
currentReader = context.reader();
}
@Override
public void collect(final int doc) throws IOException {
documents.add(currentReader.document(doc));
}
@Override
public boolean needsScores() {
return false;
}
}
You would use it like this.
public List<Document> performLuceneSearch(final Query query) throws IOException {
// the reader instance is reused as often as possible, and exchanged
// when a write occurs using DirectoryReader.openIfChanged(...).
final AllDocumentsCollector collector = new AllDocumentsCollector(this.reader.numDocs());
this.searcher.search(query, collector);
return collector.getDocuments();
}
The collector uses a list instead of a set. Document
does not implement equals
or hashCode
, so you don't profit from a set and only pay for additional equality checks. The final order is the so called index order. The first document will be the one that comes first in the index (roughly insertion order, if you don't have custom merge strategies in place, but ultimately it's an arbitrary order that is not guaranteed to be stable or reliable). Also, the collector signals that no scores are needed, which gives you about he same benefits as using option 2 from above, so you can save yourself some trouble and just leave your query as they are right now.
Depending on what you need the Document
s for, you can get an even greater speedup by using DocValues instead of stored fields. This is only true if you require only one or two of your fields, not all of them. The rule of thumb is, for few documents but many fields, use stored fields; for many documents but few fields, use DocValues. At any rate, you should experiment – 8 fields is not that much and you might profit event for all fields. Here is how you would use DocValues in your index process:
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.NumericDocValuesField;
import org.apache.lucene.document.SortedDocValuesField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.util.BytesRef;
document.add(new StringField(fieldName, stringContent, Field.Store.NO));
document.add(new SortedDocValuesField(fieldName, new BytesRef(stringContent)));
// OR
document.add(new LongField(fieldName, longValue, Field.Store.NO));
document.add(new NumericDocValuesField(fieldName, longValue));
The fieldname can be the same and you can choose to not store your other fields if you can rely completely on DocValues. The the collector has to be changed, exemplary for one field:
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.SortedDocValues;
import org.apache.lucene.search.SimpleCollector;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
final class AllDocumentsCollector extends SimpleCollector {
private final List<String> documents;
private final String fieldName;
private SortedDocValues docValues;
public AllDocumentsCollector(final String fieldName, final int numDocs) {
this.fieldName = fieldName;
this.documents = new ArrayList<>(numDocs);
}
public List<String> getDocuments() {
return Collections.unmodifiableList(documents);
}
@Override
protected void doSetNextReader(final LeafReaderContext context) throws IOException {
docValues = context.reader().getSortedDocValues(fieldName);
}
@Override
public void collect(final int doc) throws IOException {
documents.add(docValues.get(doc).utf8ToString());
}
@Override
public boolean needsScores() {
return false;
}
}
You would use getNumericDocValues
for the long fields, respectively. You have to repeat this (in the same collector of course) for all your fields that you have to load and most important: measure when its better to load full documents from the stored fields instead of using DocValues.
One final note:
I am doing locking on the application level, so Lucene won't have to worry about concurrent reads and writes.
The IndexSearcher and IndexWriter itself are already thread-safe. If you lock solely for Lucene, you can remove those locks and just share them amongst all your threads. And consider using oal.search.SearcherManager
for reusing the IndexReader/Searcher.
Upvotes: 8