imalik8088
imalik8088

Reputation: 1631

continuous deployment for stateful apache flink application on kubernetes

I want to run an apache flink (1.11.1) streaming application on kubernetes. With a filesystem state backend saving to s3. Checkpointing to s3 is working

args:
  - "standalone-job"
    - "-s"
    - "s3://BUCKET_NAME/34619f2862ce3e5fc91d80eae13a434a/chk-4/_metadata"
    - "--job-classname"
    - "com.abc.def.MY_JOB"
    - "--kafka-broker"
    - "KAFKA_HOST:9092"

So the problem that I'm facing is:

Upvotes: 3

Views: 1126

Answers (2)

mingtao wang
mingtao wang

Reputation: 115

there are several ways to deploy workloads to kubernetes, simple YAML files, Helm Chart, and Operator.

Upgrading a stateful Flink job is not as simple as upgrading a stateless service, you only need to update the binary file and restart.

Upgrading Flink Job you need to take a savepoint or get the latest checkpoint dir and then update binary and finally resubmit your job, in this case, I think simple YAML files and Helm Chart cannot help you to achieve this, you should consider implementing a Flink Operator to do the upgrading job.

https://github.com/GoogleCloudPlatform/flink-on-k8s-operator

Upvotes: 0

David Anderson
David Anderson

Reputation: 43524

  • You might be happier with Ververica Platform: Community Edition, which raises the level of abstraction to the point where you don't have to deal with the details at this level. It has an API that was designed with CI/CD in mind.
  • I'm not sure I understand your second point, but it's normal that your job will rewind and reprocess some data during recovery. Flink does not guarantee exactly once processing, but rather exactly once semantics: each event will affect the state being managed by Flink exactly once. This is done by rolling back to the offsets in the most recent checkpoint, and rolling back all of the other state to what it had been after consuming all of the data up to those offsets.
  • Having a state backend is necessary as a place to store your job's working state while the job is running. If you don't enable checkpointing, then the working state won't be checkpointed, and can not be recovered. However, as of Flink 1.11, you can enable checkpointing via the config file, using
execution.checkpointing.interval: 60000
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION

Upvotes: 2

Related Questions