user2611300
user2611300

Reputation: 133

Spark streaming with Checkpoint

I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb. Currently I am replace the existing record each time. But I see the memory is slowly increasing over time and kills the process after 1 and 1/2 hours(in aws small instance). The DB write after the restart clears all the old data. So I understand checkpoint is the solution for this. But my doubt is

  • What should my check point duration be..? As per documentation it says 5-10 times of slide duration. But I need the data of entire day. So it is ok to keep 24 hrs.
  • Where ideally should the checkpoint be..? Initially when I receive the stream or just before the window operation or after the data reduction has taken place.

  • Appreciate your help.
    Thank you

    Upvotes: 0

    Views: 2693

    Answers (1)

    Arnon Rotem-Gal-Oz
    Arnon Rotem-Gal-Oz

    Reputation: 25909

    In streaming scenarios holding 24 hours of data is usually too much. To solve that you use a probabilistic methods instead of exact measures for streaming and perform a later batch computation to get the exact numbers (if needed).

    In your case to get a distinct count you can use an algorithm called HyperLogLog. You can see an example of using Twitter's implementation of HyperLogLog (part of a library called AlgeBird) from spark streaming here

    Upvotes: 5

    Related Questions