Gabriel D
Gabriel D

Reputation: 1025

Programmatically determining when Flink Stateful Functions has fully processed a batch of Kafka events

We have a manually-executed migration program that publishes a burst of Kafka events. Other programs also publish a steady trickle of events on the same Kafka topic. We have a Flink StateFun cluster that ingests records from the Kafka topic and delivers them to our stateful functions. Our functions sometimes send messages to another function address. The ultimate output is many rows inserted into Big Query directly from our functions (no egress).

How can we automate determining when StateFun is done processing all the events from the migration program as well as the subsequent function-to-function messages?

Solutions we have considered

Have a developer monitor the logs

Currently when we want to know if the burst of Kafka events from the migration program has been fully processed (including the function-to-function messages), a developer checks the function logs to see there is a high volume of entries being written. This has these downsides:

Repeatedly compare the Big Query data to the source system

We could continually compare the data from the source system to the Big Query output and decide the migration is done when they match. But this comparison is very slow. And bugs may cause the two systems to never be in sync. Plus, one of the reasons we want to know if the migration is done is so we can kick off this comparison/QA process.

Publish an "end of migration" event

One suggestion we received was to have the migration program publish an "end of migration" event after publishing all other events. But the events go to millions of different function addresses, so I don't know if it there is any guarantee that the "end of migration" event would be processed after all the other events and their function-to-function messages.

Use the Kafka consumer offset from Flink metrics

After the migration program finishes publishing, we could get the Kafka offset from /opt/kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell. Then we could watch the committedOffsets metric from the Flink Metrics REST API to determine how far StateFun has processed. This seems like a cleaner variant of the "end of migration" solution above and probably has the same drawbacks.

An awesome solution

This is what I want! But I'd settle for confirmation that the above solutions are the best we'll reasonably get.

Upvotes: 1

Views: 90

Answers (0)

Related Questions