Spark streaming with Checkpoint

Question

I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb. Currently I am replace the existing record each time. But I see the memory is slowly increasing over time and kills the process after 1 and 1/2 hours(in aws small instance). The DB write after the restart clears all the old data. So I understand checkpoint is the solution for this. But my doubt is

What should my check point duration be..? As per documentation it says 5-10 times of slide duration. But I need the data of entire day. So it is ok to keep 24 hrs.

Where ideally should the checkpoint be..? Initially when I receive the stream or just before the window operation or after the data reduction has taken place.

Appreciate your help.
Thank you

Arnon Rotem-Gal-Oz · Accepted Answer

In streaming scenarios holding 24 hours of data is usually too much. To solve that you use a probabilistic methods instead of exact measures for streaming and perform a later batch computation to get the exact numbers (if needed).

In your case to get a distinct count you can use an algorithm called HyperLogLog. You can see an example of using Twitter's implementation of HyperLogLog (part of a library called AlgeBird) from spark streaming here

Spark streaming with Checkpoint

Answers (1)

Related Questions