Parallel processing large SQL table with Camel

Question

I am trying to process about 7 million rows daily from an Informix table with Apache Camel but I can't figure out how it can be accomplished.

My first attempt that was working with a very low set of data (about 50k rows) was using .split(body()).parallelProcessing() like so:

from("quartz2://transaccionesGroup/myTimerTransaction?cron=0+1+0+*+*+?")
.bean(QueryTable.class, "queryData").split(body()).parallelProcessing() // Essentially executes a query on my table and returns a list of MyTable.class
.bean(ProcessTable.class, "processData") // Converts each MyTable object into another type of object (NewData.class) for later processing, storing in them in a synchronized list
.end().to("direct:transform-data");

from("direct:transform-data")
.bean(ProcessNewData.class, "processNewData").split(body()).parallelProcessing() // Obtains list
.bean(AnalyzeData.class, "analyze") // Analyzes the data
.bean(PersistData.class, "persist") // Persists the new data on other tables
.end();

This of course resulted in an "OutOfMemory" error when I tried it with 500k rows on .bean(QueryTable.class, "queryData").split(body()).parallelProcessing() because it first tried caching all of the data from the query before parsing it. I tried setting fetchSize to something like 100 but I got the same error, and using maxRows would only get me the amount of rows I specified and ignore the rest.

My next attempt was using one of Camel's components like sql-component and jdbc and trying to use a Splitter to process each row in separate threads but I got the same exact problem.

sql:

from("quartz2://transaccionesGroup/myTimerTransaction?cron=0+1+0+*+*+?")
.bean(QueryTable.class, "queryDataParams") // Gets the params for my query
.to("sql:SELECT * FROM my_table WHERE date_received BETWEEN :#startDate AND :#endDate?dataSource=dataSourceInformix").split(body()).parallelProcessing()
// The rest would be essentially the same

jdbc:

from("quartz2://transaccionesGroup/myTimerTransaction?cron=0+1+0+*+*+?")
.bean(QueryTable.class, "queryString") // Gets the query to execute
.to("jdbc:dataSourceInformix").split(body()).parallelProcessing()

My last attempt was to use maxMessagesPerPoll for sql and outputType=StreamList for jdbc components but unfortunatelly the former only processes one row at a time (as well as it has to be a consumer to be used as such) and the latter gives me an java.sql.SQLException: Cursor not open exception.

sql:

from("sql:" + query +"?dataSource=dataSourceInformix&maxMessagesPerPoll=100") // I need to be able to use the quartz2 component

jdbc:

.to("jdbc:dataSourceInformix?outputType=StreamList").split(body()).streaming() // Throws exception

The end goal is to be able to process millions of rows without consuming so much memory so as to prevent the "OutOfMemory" error. My idea, if possible is to do the following:

Create my query on quartz cron-trigger
Obtain and group N amount of results
Send a group of results to be process (in another thread) whilst another group is being obtained
Repeat untill all data has been processed

I know this question is similar to this one but the answer doesn't really help my situation. I also noticed that in the documentation for the sql component it has an outputType=StreamList option for the producer but it is implemented on version 2.18 and higher while I have version 2.14.1.

Any help and tips would be extremely helpful!

Thanks.

Some other info: Apache Camel Version: 2.14.1 Database: Informix

Parallel processing large SQL table with Camel

Answers (1)

Info:

Solution 1:

Solution 2:

Conclusions:

Related Questions