Right way to transfer millions of rows from source database table to destination database table

Question

I have been running into the same problem and don't have a good general solution -

The scenario is:

I have a source table in a database (like MS SQL Server) that has a few million rows. With a couple Gb of table data overall.
I want to insert those rows into a destination table in another system. This part I feel comfortable and seems to work well.

The issue is that running a single thread to read all X million rows in a single select/connection always runs into problems. So I would like Pentaho to enable me to say make multiple selects and process 100K or 500K rows per "batch" and keep processing until there are no more rows.

I can hard-code a simple script to run pan.sh with a named parameters for a start row and batch size - that works great but I have to pre-calculate the script steps and actual starting row numbers.

Ideally I wish Pentaho could set a "Number of Copies" and a batch size on the Table Input step so it would be automagic!

Does someone have sample job definition that gets a row count for a table - then "loops" a call to the transformation until all rows are processed? Maybe some chunk of the batches could run in parallel for extra credit.

Right way to transfer millions of rows from source database table to destination database table

Answers (1)

Related Questions