Reputation: 243
I have a streaming dataframe and I want to calculate some daily counters. So far, I have been using tumbling windows with watermark as follows:
.withWatermark("timestamp", "10 minutes") \
.groupBy(window("timestamp","1 day")) \
.count()
My question is whether this is the best way (resource wise) to do this daily aggregation, or whether I should instead perform a series of aggregations on smaller windows (say hourly or even less) and then aggregate these hourly counters to achieve the daily count.
Moreover, if I try the second approach, meaning the smaller windows, how can I do this?
I can not perform both aggregations (the hourly and daily) within the same spark streaming application, I keep getting the following:
Multiple streaming aggregations are not supported with streaming
DataFrames/Datasets.
Should I therefore use a spark application to post the hourly aggregations to a Kafka topic, read this stream from another spark application and perform the daily sum up?
If yes, then how should I go about the "update" outputmode in the producer? The second application will be getting the updated values from the first application and therefore this "sum up" will be wrong.
Moreover, adding any trigger
will also not work with the watermark, since any late events arriving will cause a previous counter update and I would be running into the same problem again.
Upvotes: 0
Views: 1325
Reputation: 4045
I think you should perform aggregation on the most shortest time span required and then perform secondary aggregation on those primary aggs. Performing a agg for 1 day
would OOM your job if not now then definitely in future.
This would increase some DevOps efforts but it is but you could visually monitor your application in real-time.
Upvotes: 1