Reputation: 77
How can I merge 2 or more lucene indexes and avoid duplicated values in my final index?
Today, I'm using this code to do merge among indexes:
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
IndexWriter writer = new IndexWriter(getFSDirectory(INDEX_DIR), iwc);
LogMergePolicy logMerge = new LogMergePolicy() {
@Override
protected long size(SegmentInfo arg0) throws IOException {
return 0;
}
};
logMerge.setMergeFactor(1000);
iwc.setRAMBufferSizeMB(50);
Directory indexes[] = new Directory[INDEXES_DIR.size()];
for (int i = 0; i < INDEXES_DIR.size(); i++) {
Directory d = FSDirectory.open(new File(INDEXES_DIR.get(i)).getAbsoluteFile());
System.out.println("Adding: " + INDEXES_DIR.get(i));
indexes[i] = d;
}
System.out.print("Merging added indexes...");
writer.addIndexes(indexes);
System.out.println("done");
Upvotes: 2
Views: 1532
Reputation: 33341
I don't believe Lucene provides any nice easy way to do that, like addIndexes
.
You will likely have to either:
Make another pass through the index to remove duplicates. You could use TermsEnum to get the term()
and docFreq()
of your id field in each document, to detect duplicates. You could then get the DocIDs from a DocsEnum
, from a call to TermsEnum.docs
.
Or, probably the saner way, perform the merge yourself, using IndexWriter.updateDocument to prevent duplicates.
Upvotes: 2