Reputation: 31
As far as I know checkpoint failure should be ignored and retried with potentially larger state. I had this situation
exception was thrown
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category WRITE is not supported in state standby. Visit https://s.apache.org/sbnn-error ..................
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:453) at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.mkdirs(SafetyNetWrapperFileSystem.java:111) at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory.createBasePath(FsCheckpointStreamFactory.java:132)
The pipeline came back after a few restarts and checkpoint failures, after the hdfs issues were resolved.
I would not have worried about the restart, but it was evident that I lost my operator state. Either it was my kafka consumer that kept on advancing it's offset between a start and the next checkpoint failure ( a minute's worth ) or the the operator that had partial aggregates was lost. I have a 15 minute window of counts on a keyed operator
I am using ROCKS DB and of course have checkpointing turned on.
The questions thus are
Upvotes: 0
Views: 1354
Reputation: 452
That is depend upon your program style. Suppose after getting the conformation from check point function..Your program may be working.
Without checkpoint conformation if your writing program , it will not affected your pip lining.
Further Clarification
https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/stream_checkpointing.html
Upvotes: 1