spark consume from stream -- considering data for longer period

Question

We have a spark job running which consumes data from kafka stream , do some analytics and store the result.

Since data is consumed as they are produced to kafka, if we want to get

count for the whole day, count for an hour, average for the whole day

that is not possible with this approach. Is there any way which we should follow to accomplish such requirement

Appreciate any help

Thanks and Regards

Raaghu.K

RBanerjee · Accepted Answer

Your streaming job is not supposed to calculate the Daily count/Avg.

Approach 1 : You can store the data consumer from Kafka into a persistent storage like DB/HBase/HDFS , and then you can run Daily batch which will calculate all the statistics for you like Daily count or avg.

Approach 2 : In order to get that information form streaming itself you need to use Accumulators which will hold the record count,sum. and calculate avg according.

Approach 3 : Use streaming window, but holding data for a day doesn't make any sense. If you need 5/10 min avg, you can use this.

I think the first method is preferable as it will give you more flexibility to calculate all the analytics you want.

spark consume from stream -- considering data for longer period

Answers (1)

Related Questions