Mehul
Mehul

Reputation: 25

Checkpointing Getting Failed In Flink Streaming Job(Table/Sql Api)

My job Flow Works like:

Src[Kafka] -> Lookup With Mysql -> Deduplication(Using Top N on proc time)-> Upsert Kafka/Mysql

But my job is running fine data is flowing perfectly to Kafka and Mysql but it is failing on checkpoint, Attached image for the same.

enter image description here

Ps : for the time being I have disabled the checkpointing but when I enable with same properties it fails

enter image description here

Upvotes: 0

Views: 1624

Answers (1)

David Anderson
David Anderson

Reputation: 43454

The checkpoint is failing because it is timing out. The typical cause of checkpoint timeouts is backpressure that prevents the checkpoint barriers from making sufficiently rapid progress across the execution graph. Another possibility is inadequate bandwidth or quota for writing to the checkpoint storage.

Some ideas:

  • increase the timeout (the default timeout is 10 minutes; yours has been reduced to 2 minutes)
  • enable unaligned checkpoints (this should lessen the impact of backpressure on checkpoint times)
  • find the cause of the backpressure and alleviate it (the mysql lookup is an obvious candidate)
  • examine the parallel subtasks for evidence of asymmetries in checkpoint sizes, alignment times, etc. indicating skew in the processing caused by hot keys, or misaligned watermarks, or other clues

Upvotes: 1

Related Questions