malsmith
malsmith

Reputation: 819

Right way to transfer millions of rows from source database table to destination database table

I have been running into the same problem and don't have a good general solution -

The scenario is:

The issue is that running a single thread to read all X million rows in a single select/connection always runs into problems. So I would like Pentaho to enable me to say make multiple selects and process 100K or 500K rows per "batch" and keep processing until there are no more rows.

I can hard-code a simple script to run pan.sh with a named parameters for a start row and batch size - that works great but I have to pre-calculate the script steps and actual starting row numbers.

Ideally I wish Pentaho could set a "Number of Copies" and a batch size on the Table Input step so it would be automagic!

Does someone have sample job definition that gets a row count for a table - then "loops" a call to the transformation until all rows are processed? Maybe some chunk of the batches could run in parallel for extra credit.

Upvotes: 1

Views: 3534

Answers (1)

Codek
Codek

Reputation: 5164

Ah, you called?

Yes; the way to do that is indeed to use multiple copies and then to use the "number" of the step and the mod function.

So; if you have 2 copies of the step, if you say:

where rownum % 2 = 0

in the query then you'll pull out every second row.

rownum could be an ID or some other numeric column - needs to be an evenly spread ID of course. Also helps if its indexed - and especially good if it is part of the underlying database partition.

This approach works really well for slow network connections too - lets you load up multiple connections to the database.

Doco on the variables to use in this jira:

http://jira.pentaho.com/browse/PDI-3253

Its then your choice if you wish to keep the "partition" flowing downstream.

Upvotes: 1

Related Questions