Gabriela M
Gabriela M

Reputation: 615

Incremental upload/update to PostgreSQL table using Pentaho DI

I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.

The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?

In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.

See my implementation:

enter image description here

Thanks in advance!!

Upvotes: 0

Views: 1638

Answers (2)

Kullayappa  M
Kullayappa M

Reputation: 174

you are looking for an incremental load. you can do it in two ways.

  1. There is a step called "Insert/Update" , this will be used to do incremental load.

you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.

  1. Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.

Editing here.. if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.

Hope this will help.

Upvotes: 0

Cristian Curti
Cristian Curti

Reputation: 1054

Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.

Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.

This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".

The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.

Upvotes: 1

Related Questions