Reputation: 562
We're currently big users of Dataflow batch jobs and wanting to start using Dataflow streaming if it can be done reliably.
Here is a common scenario: We have a very large Kafka topic that we need to do some basic ETL or aggregation on and a non idempotent upstream queue. Here is an example of our Kafka data:
ID | msg | timestamp (mm,ss)
-----------------------
1 | A | 01:00
2 | B | 01:01
3 | D | 06:00
4 | E | 06:01
4.3 | F | 06:01
.... | ...... | ...... (millions more)
4.5 | ZZ | 19:58
Oops, the data changes from integers to decimals at some point, which will eventually cause some elements to fail, necessitating us to kill the pipeline, possibly modify the downstream service, and possibly make minor code changes to the Dataflow pipeline.
In Spark Structured Streaming, because of the ability to use external checkpoints, we would be able to restart a streaming job and resume processing the queue where the previous job left off (successfully processing) for exactly once processing. In a vanilla or spring boot Java Application we could loop through with a Kafka consumer, and only after writing results to our 'sink', commit offsets.
My overall question is can we achieve similar functionality in Dataflow? I'll list some of my assumptions and concerns:
Now in our scenario our Pipeline started failing around F, it seems our only choice is to start from the beginning or lose data? I believe this might be overcome with a lot of custom code (Custom DoFn to ensure the Kafka Consumer never commits) and some custom code for our upstream sink that would eventually commit offsets. Is there a better way to do this, and/or are some my assumptions wrong about how offset management is handled in Dataflow?
Upvotes: 4
Views: 1422
Reputation: 6023
Thank you for the detailed question!
Upvotes: 3