Stark
Stark

Reputation: 2621

Spring cloud dataflow for ETL

we are attempting to design an ETL solution by leveraging spring cloud dataflow.

The requirements are mainly to:

  1. Query an external source database with read-only access that could be as large as ~400k rows
  2. Perform minimal transformation / Data quality
  3. Upload/sink to a postgres data mart using jdbc sink. (truncate the table prior to upload daily)
  4. Run this every 24 hours

Some challenges we faced:

We prefer not to use Tasks / Batch as the streams pipeline is much more powerful and cleaner (data can flow through the stream and transformed). Have anyone successfully used SCDF streams to achieve a similar goal or is it simply not designed for this use case?

Upvotes: 1

Views: 904

Answers (1)

Sabby Anandan
Sabby Anandan

Reputation: 5651

The requirement described can be solved in both Streaming and Batch model in SCDF. And, it appears you're able to relate to both the features well already. :)

A few options to consider.

For #1, since you don't have a way to convince the database team to add a new column in the existing table, you could build a custom stream/batch app that mimics what the JDBC-source does, but instead, use another datastore to keep track of already "seen" or "processed" records. Ultimately, you'd need some form of a persistent storage/cache to decide what rows to process and not.

Alternatively, you could build a custom source that builds upon CDC capabilities that are supported in the databases. We are attempting to build one using Debezium [see: spring-cloud-stream-app-starters/app-starters-release#173], and there's a POC on this matter at tzolov/cdc-debezium. With that, you'd simply just rely on the supported databases' CDC eventing model to only receive new records from the table automatically.

Feel free to try out the POC and comment on issue#173 with your findings/use-cases.

The #2 is more of a design question. If a big resulset needs to be processed in 1-go, does it fit the streaming model? Perhaps you could design that to be a task/batch-job to process them offline, so you'd have better leverage on failure-recovery and restartability.

Once when you have a CDC-like source in place, what you receive will only be the new events/rows anyway, so that might automatically address the flow control.

Upvotes: 1

Related Questions