I want to run an apache flink (1.11.1) streaming application on kubernetes. With a filesystem state backend saving to s3. Checkpointing to s3 is working args: - "standalone-job" - "-s" - "s3://BUCKET_NAME/34619f2862ce3e5fc91d80eae13a434a/chk-4/_metadata" - "--job-classname" - "com.abc.def.MY_JOB" - "--kafka-broker" - "KAFKA_HOST:9092" So the problem that I'm facing is: I have to select the previous state dir manually. Is there a possibility to make it better? The job increments the chk dir but it does not use the checkpoint. Means I throw a new event when I have seen an event for the first time and store it to a ListState<String> whenever I deploy via Gitlab a newer version of my application it again throws this event. Why do I have to enable the checkpointing explicitly in my code when I have defined the state.backend to filesystem? env.enableCheckpointing(Duration.ofSeconds(60).toMillis()); and env.getCheckpointConfig().enableExternalizedCheckpoints(RETAIN_ON_CANCELLATION);

Reputation: 1631

continuous deployment for stateful apache flink application on kubernetes

I want to run an apache flink (1.11.1) streaming application on kubernetes. With a filesystem state backend saving to s3. Checkpointing to s3 is working

args:
  - "standalone-job"
    - "-s"
    - "s3://BUCKET_NAME/34619f2862ce3e5fc91d80eae13a434a/chk-4/_metadata"
    - "--job-classname"
    - "com.abc.def.MY_JOB"
    - "--kafka-broker"
    - "KAFKA_HOST:9092"

So the problem that I'm facing is:

I have to select the previous state dir manually. Is there a possibility to make it better?
The job increments the chk dir but it does not use the checkpoint. Means I throw a new event when I have seen an event for the first time and store it to a ListState<String> whenever I deploy via Gitlab a newer version of my application it again throws this event.
Why do I have to enable the checkpointing explicitly in my code when I have defined the state.backend to filesystem? env.enableCheckpointing(Duration.ofSeconds(60).toMillis()); and env.getCheckpointConfig().enableExternalizedCheckpoints(RETAIN_ON_CANCELLATION);

Upvotes: 3

Answers (2)

mingtao wang

Reputation: 115

there are several ways to deploy workloads to kubernetes, simple YAML files, Helm Chart, and Operator.

Upgrading a stateful Flink job is not as simple as upgrading a stateless service, you only need to update the binary file and restart.

Upgrading Flink Job you need to take a savepoint or get the latest checkpoint dir and then update binary and finally resubmit your job, in this case, I think simple YAML files and Helm Chart cannot help you to achieve this, you should consider implementing a Flink Operator to do the upgrading job.

https://github.com/GoogleCloudPlatform/flink-on-k8s-operator

Upvotes: 0

David Anderson

Reputation: 43524

You might be happier with Ververica Platform: Community Edition, which raises the level of abstraction to the point where you don't have to deal with the details at this level. It has an API that was designed with CI/CD in mind.
I'm not sure I understand your second point, but it's normal that your job will rewind and reprocess some data during recovery. Flink does not guarantee exactly once processing, but rather exactly once semantics: each event will affect the state being managed by Flink exactly once. This is done by rolling back to the offsets in the most recent checkpoint, and rolling back all of the other state to what it had been after consuming all of the data up to those offsets.
Having a state backend is necessary as a place to store your job's working state while the job is running. If you don't enable checkpointing, then the working state won't be checkpointed, and can not be recovered. However, as of Flink 1.11, you can enable checkpointing via the config file, using

execution.checkpointing.interval: 60000
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION

Upvotes: 2

continuous deployment for stateful apache flink application on kubernetes

So the problem that I'm facing is:

Answers (2)

Related Questions