Reputation: 473
We have a spark job running which consumes data from kafka stream , do some analytics and store the result.
Since data is consumed as they are produced to kafka, if we want to get
count for the whole day, count for an hour, average for the whole day
that is not possible with this approach. Is there any way which we should follow to accomplish such requirement
Appreciate any help
Thanks and Regards
Raaghu.K
Upvotes: 1
Views: 62
Reputation: 957
Your streaming job is not supposed to calculate the Daily count/Avg.
Approach 1 : You can store the data consumer from Kafka into a persistent storage like DB/HBase/HDFS , and then you can run Daily batch which will calculate all the statistics for you like Daily count or avg.
Approach 2 : In order to get that information form streaming itself you need to use Accumulators which will hold the record count,sum. and calculate avg according.
Approach 3 : Use streaming window, but holding data for a day doesn't make any sense. If you need 5/10 min avg, you can use this.
I think the first method is preferable as it will give you more flexibility to calculate all the analytics you want.
Upvotes: 2