Reputation: 10132
I have written a program to index database data to disk and I am not sure if my indexing speed is appropriate i.e. if I am very slow or not and if speed can be further improved.
Speed that I get is around 15000 Documents per Hour which amounts to around 2600 KB of Index Directory Size for creation of new indices.
I am using Lucene 6.0.0 and Windows 8.1 64 bit OS, 16 GB RAM and Intel Core i7 8 Core machine. I am doing indexing on local machine and not sure what kind of disks I have, its the usual one that comes with Windows PC.
I am using Spring Batch to INNER JOIN
two database tables and get a Row Mapped Object from ItemReader
then I prepare Document
from this object.
I am always using method, writer.updateDocument(contentDuplicateKeyTerm, doc);
and not addDocument(doc)
since in Lucene 6.0.0 updateDocument
adds a document to index if document doesn't already exist in addition to updating existing document.
I am not aware of any bench mark to compare my program to.
Please suggest.
EDIT: Now, I am able to achieve performance of around 1,80,000 documents per hour. Issue was doing IndexWriter.commit()
after updating each Document
, now I commit at regular intervals and that has improved performance greatly.
Upvotes: 1
Views: 3548
Reputation: 10132
I was making multiple mistakes and that is why write performance was slow. Some of mistakes and rectifications were:
I was committing after each document, so I changed the program to commit after each chunk, as I am using Spring Batch. Increasing commit interval improved performance significantly.
I was closing and reopening writer instances unnecessarily ( initially the logic was designed to do so ). I changed the logic to maintain a single writer instance in the application scope and keep reusing it as needed.
Source data was from a DB2 database and reading was slow from tables. I added indexes to increase read performance.
Lucene writer is thread safe so I started writing in a multi threaded way instead of using a single thread.
So after increasing Lucene writer commit interval, indexing itself doesn't take as much time provided I have enough memory to hold large sets of documents. Document read and preparation doesn't take as much time. Lucene can index a few million documents in just a couple of minutes on modern machines.
Upvotes: 3