Reputation: 819
I have been running into the same problem and don't have a good general solution -
The scenario is:
The issue is that running a single thread to read all X million rows in a single select/connection always runs into problems. So I would like Pentaho to enable me to say make multiple selects and process 100K or 500K rows per "batch" and keep processing until there are no more rows.
I can hard-code a simple script to run pan.sh with a named parameters for a start row and batch size - that works great but I have to pre-calculate the script steps and actual starting row numbers.
Ideally I wish Pentaho could set a "Number of Copies" and a batch size on the Table Input step so it would be automagic!
Does someone have sample job definition that gets a row count for a table - then "loops" a call to the transformation until all rows are processed? Maybe some chunk of the batches could run in parallel for extra credit.
Upvotes: 1
Views: 3534
Reputation: 5164
Ah, you called?
Yes; the way to do that is indeed to use multiple copies and then to use the "number" of the step and the mod function.
So; if you have 2 copies of the step, if you say:
where rownum % 2 = 0
in the query then you'll pull out every second row.
rownum could be an ID or some other numeric column - needs to be an evenly spread ID of course. Also helps if its indexed - and especially good if it is part of the underlying database partition.
This approach works really well for slow network connections too - lets you load up multiple connections to the database.
Doco on the variables to use in this jira:
http://jira.pentaho.com/browse/PDI-3253
Its then your choice if you wish to keep the "partition" flowing downstream.
Upvotes: 1