Reputation: 1539
I have Flink stream processing application which read stream of messages from Pulsar Topic, process them and store the file in S3. It perform below operation.
Happy path works very well. Problem starts with partial failures and recovery.
Step# 2 can create multiple different streams as there will be different keys in the stream.
Check point 1 Triggered. Stream 1 (Key 1) -- Processing is successful. Stream 2 (Key 2) -- Processing is failed for some reason at step 3 or 4 above. Check point 2 Completed.
If I throw exception, in case stream 2 is failed, It will fail the whole job and reprocess from Checkpoint 1. In this case, Stream 1 will be reprocessed which should not happen.
Is there a way in Flink we can manually avoid acknowledging Pulsar topic for only failed messages or only process failed records after restart. My requirement is to not to perform duplicate processing and reprocess only failed records.
I read savepoint can be one of the solution but did not find any concrete example.
Appreciate your help!!
Upvotes: 0
Views: 224
Reputation: 43454
Flink only allows for partial restarts in cases where doing so doesn't compromise correctness -- i.e., in streaming pipelines without any repartitioning. The keyBy
in your use case makes this unworkable.
Upvotes: 0
Reputation: 9245
The short answer in "no". Flink tracks source offsets and sink transactions (plus operator state), in order to support efficient exactly once processing.
Upvotes: 0