Reputation: 1018
I'm trying to figure out whats holding the index speed back. I'm extracting text from pdf's to index each page seperatly to solr to get page hit results.
I was using commit after every "document". Then I noticed its spend loads of time rebuilding the index euch time I used commit.
Now I use this:
<autoCommit> <maxDocs>10000</maxDocs> <maxTime>60000</maxTime> </autoCommit>
To get a commit every minute.
But then I was calculating and found out it indexed around 30 'documents'(pages as solrDoc)/sec or 10 real documents/sec. This seems pretty slow compared to other setups.
How could I increase my speed?
Extra info:(request if needed)
My documents contain 7 fields.(1 content field with the text on the page)
I use Solrj to add documents to solr.
I'm using the example config since I have no advanced knowledge of Solr
pc intel core i7 2600+16Gb ram+ssd (this is a dev computer not the final server but it should be pretty fast) Not much of the cpu and ram is used.
I get the files from an external storage. (but its fast I could easily get 12MB/s)
I extract the text using pdfbox
It took 390 Minutes to make a 650Mb index(455600 solrdocuments )
Upvotes: 1
Views: 372
Reputation: 15791
one aspect is whether your process is multithreaded or not, if not, test by having several threads extracting text from pdf and then hand over to solr for indexing.
Upvotes: 1