Reputation: 60
My flink job has keyBy operator which takes date~clientId(date as yyyymmddhhMM, MM as minutes which changes after 5 mins) as key. This operator is followed by tumbling window of 5 mins. We have kafka input of 3 millions/min events on an average and around 20 millions/min events on peak time. Checkpointing duration and minimum pause between two checkpoiting is 3 mins.
Now here are my doubts :
1) Does the state created by keyBy is persisted forever or it's evicted after 5 minutes.
2) What changes are required in case i change this window to 30 minutes.
3) How the checkpointing time is affected by the window size.
4) What will be effect in a scenerio where number of distinct client in any 5 minutes goes 5-10 times. Will this create data skew. As 1-2 sub-tasks in my job always takes around 1-2 minutes as compare to other 800 sub-taks which completes in 10-15 seconds.
5) I am getting one exception once in every 5-6 hours which restarts the flink job. TimerException{java.nio.channels.ClosedByInterruptException} at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask. What maybe the probable reason.
Upvotes: 0
Views: 492
Reputation: 43717
A few points:
keyBy is not an operator, and has no state. keyBy is simply a declaration of how the stream is to be repartitioned. The tumbling window that follows the keyBy does have state, which is purged once the window is complete. You can see how much state each subtask has if you look at the breakdown in the checkpoint stats part of the web UI.
Here's an example:
What will be effect in a scenario where number of distinct client in any 5 minutes goes 5-10 times. Will this create data skew? As 1-2 sub-tasks in my job always takes around 1-2 minutes as compared to the other 800 sub-tasks which complete in 10-15 seconds.
Perhaps you have one or a few clients with many more events than the rest?
It would be interesting to understand why you are doing event-time-based keying followed by processing time windows, rather than using event time windows. (I assume you are using processing time windows, correct me if I'm wrong.)
Do you have any idea how many different timeframes are active at once? E.g., the window for 12:00-12:05 will receive many events with timestamps in the range of 12:00-12:05, plus some events for 11:55-12:00 that didn't arrive by 12:00. And possibly events for earlier timeframes, if that much delay is possible. It's hard to think about key skew without understanding what the active keyspace looks like.
Upvotes: 1