I have a table with around 100,000 rows at the moment. I want to index the data in this table in a Solr Index. So the naive method would be to: Get all the rows For each row: convert to a SolrDocument and add each document to a request Once all rows are converted then post the request Some problems with this approach that I can think of are: Loading too much data (the content of the whole table) in to memory POSTing a big request However, some advantages: Only one request to the Database Only one POST request to Solr The approach is not scalable, I see that since as the table grows so will the memory requirements and the size of the POST request. I need to perhaps take n number of rows, process them, then take the next n ? I'm wondering if any one has any advice about how to best implement this? (ps. I did search the site but I didn't find any questions that were similar to this.) Thanks.

databasesolrindexingbatch-processing

C0deAttack

Reputation: 24667

Best way to index database table data in Solr?

I have a table with around 100,000 rows at the moment. I want to index the data in this table in a Solr Index.

So the naive method would be to:

Get all the rows
For each row: convert to a SolrDocument and add each document to a request
Once all rows are converted then post the request

Some problems with this approach that I can think of are:

Loading too much data (the content of the whole table) in to memory
POSTing a big request

However, some advantages:

Only one request to the Database
Only one POST request to Solr

The approach is not scalable, I see that since as the table grows so will the memory requirements and the size of the POST request. I need to perhaps take n number of rows, process them, then take the next n?

I'm wondering if any one has any advice about how to best implement this?

(ps. I did search the site but I didn't find any questions that were similar to this.)

Thanks.

Upvotes: 5

Answers (3)

C0deAttack

Reputation: 24667

I used the suggestion from nikhil500:

DIH does support many transformers. You can also write custom transformers. I will recommend using DIH if possible - I think it will need the least amount of coding and will be faster than POSTing the documents. – nikhil500 Feb 6 at 17:42

Upvotes: 1

Jesvin Jose

Reputation: 23078

I once had to upload ~3000 rows (each of 5 fields) from DB to Solr. I ran uploaded each document separately and did a single commit. The entire operation took only a few seconds, but some uploads (8 of 3000) had failed.

What worked perfectly was uploading in batches of 50 before commiting. 50 may have been very low. There are recommended limits to how many documents you can upload before doing a commit. It depends of the size of the documents.

But then, this is a one-off operation, which you can supervise with a hacked script. Would a subsequent operation make you index 100,000 rows at once? Or can you get away with indexing only a few hundred updated documents per operation?

Upvotes: 0

nfechner

Reputation: 17525

If you want to balance between POSTing all documents at once and doing one POST per document you could use a queue to collect documents and run a separate thread that sends documents once you have collected enough. This way you can manage the memory vs. request time problem.

Upvotes: 2

Best way to index database table data in Solr?

Answers (3)

Related Questions