Reputation: 2621
we are attempting to design an ETL solution by leveraging spring cloud dataflow.
The requirements are mainly to:
Some challenges we faced:
We tried using JDBC-source starter app to connect to source databases,
however there are restrictions to databases that we are accessing from, as they are owned by another department, so the jdbc.update feature to update the source database to mark a row as "seen" could not be used.
Is there a recommended way to query and page through a large resultset without running out of memory? The jdbc.max-rows-per-poll option seems not suitable in this situation.
We prefer not to use Tasks / Batch as the streams pipeline is much more powerful and cleaner (data can flow through the stream and transformed). Have anyone successfully used SCDF streams to achieve a similar goal or is it simply not designed for this use case?
Upvotes: 1
Views: 904
Reputation: 5651
The requirement described can be solved in both Streaming and Batch model in SCDF. And, it appears you're able to relate to both the features well already. :)
A few options to consider.
For #1, since you don't have a way to convince the database team to add a new column in the existing table, you could build a custom stream/batch app that mimics what the JDBC-source does, but instead, use another datastore to keep track of already "seen" or "processed" records. Ultimately, you'd need some form of a persistent storage/cache to decide what rows to process and not.
Alternatively, you could build a custom source that builds upon CDC capabilities that are supported in the databases. We are attempting to build one using Debezium [see: spring-cloud-stream-app-starters/app-starters-release#173], and there's a POC on this matter at tzolov/cdc-debezium. With that, you'd simply just rely on the supported databases' CDC eventing model to only receive new records from the table automatically.
Feel free to try out the POC and comment on issue#173 with your findings/use-cases.
The #2 is more of a design question. If a big resulset needs to be processed in 1-go, does it fit the streaming model? Perhaps you could design that to be a task/batch-job to process them offline, so you'd have better leverage on failure-recovery and restartability.
Once when you have a CDC-like source in place, what you receive will only be the new events/rows anyway, so that might automatically address the flow control.
Upvotes: 1