C0deAttack
C0deAttack

Reputation: 24667

Best way to index database table data in Solr?

I have a table with around 100,000 rows at the moment. I want to index the data in this table in a Solr Index.

So the naive method would be to:

Some problems with this approach that I can think of are:

However, some advantages:

The approach is not scalable, I see that since as the table grows so will the memory requirements and the size of the POST request. I need to perhaps take n number of rows, process them, then take the next n?

I'm wondering if any one has any advice about how to best implement this?

(ps. I did search the site but I didn't find any questions that were similar to this.)

Thanks.

Upvotes: 5

Views: 1458

Answers (3)

C0deAttack
C0deAttack

Reputation: 24667

I used the suggestion from nikhil500:

DIH does support many transformers. You can also write custom transformers. I will recommend using DIH if possible - I think it will need the least amount of coding and will be faster than POSTing the documents. – nikhil500 Feb 6 at 17:42

Upvotes: 1

Jesvin Jose
Jesvin Jose

Reputation: 23078

I once had to upload ~3000 rows (each of 5 fields) from DB to Solr. I ran uploaded each document separately and did a single commit. The entire operation took only a few seconds, but some uploads (8 of 3000) had failed.

What worked perfectly was uploading in batches of 50 before commiting. 50 may have been very low. There are recommended limits to how many documents you can upload before doing a commit. It depends of the size of the documents.


But then, this is a one-off operation, which you can supervise with a hacked script. Would a subsequent operation make you index 100,000 rows at once? Or can you get away with indexing only a few hundred updated documents per operation?

Upvotes: 0

nfechner
nfechner

Reputation: 17525

If you want to balance between POSTing all documents at once and doing one POST per document you could use a queue to collect documents and run a separate thread that sends documents once you have collected enough. This way you can manage the memory vs. request time problem.

Upvotes: 2

Related Questions