Reputation: 21
We are a set of students that uses Lucene.net to index several 100,000 music fingerprints and match them with given fingerprints from analyzed songs to see if they match anything that we have in our database.
As plenty of new music is released every day, we are trying to update our index with new data often by approximately 5-8,000 fingerprints a week. The problem occurs when we add several thousand entries directly to our existing index, as this results in our searching severely deteriorating or not being findable at all. (We are totally new to Lucene indexing)
In order to fix this, we have to recreate our index which is a very long process (up to 18 hours). My question is if there are any other alternatives than recreating the entire index? We have considered having multiple indexes and using a MultiReader, but this seems like it just delays the problem?
Lucene.Net.Store.Directory directory = FSDirectory.Open (new System.IO.DirectoryInfo (luceneIndexPath));
IndexWriter iw = null;
int fingerCount = 0;
try {
iw = new IndexWriter (directory, new StandardAnalyzer (Lucene.Net.Util.Version.LUCENE_30), false, IndexWriter.MaxFieldLength.UNLIMITED);
iw.UseCompoundFile = false;
iw.SetSimilarity (new CDR.Indexer.DefaultSimilarityExtended ());
iw.MergeFactor = 10; // default = 10
iw.SetRAMBufferSizeMB (512 * 3);
Document doc = new Document ();
doc.Add (new Field ("FINGERID", "", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add (new Field ("SUBFINGER", "", Field.Store.NO, Field.Index.ANALYZED));
}
iw.AddDocument (doc);
iw.commit ();
iw.dispose ();
Thank you very much for your considerations!
Upvotes: 2
Views: 953
Reputation: 33791
I'm also new to lucene.net but one thing I have noticed is that indexing is much faster if you don't flush or commit after each document. So if you are adding thousands of new docs to an index, let lucene manage when to flush it's memory buffers and only call commit in your code after all the docs are added.
This does mean that you are not ensured that the new docs are flushed to disk until you call commit (which implicitly flushes to disk) but the indexing speed will be much faster since lucene doesn't need to create a new index segment on disk for each doc that then later needs merged but instead can do those initial merges in memory for each new document until it needs to flush the memory buffer to disk in which case only one new segment is written to disk for the totality of those documents that were "premerged" if you will. This approach greatly reduces IO to disk for the thousands of added docs and hence the speed increase.
Upvotes: 1