Reputation: 93
I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.
The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTAMP.
At the moment I'm loading this file and extracting all the events in a JavaRDD and I would like to pass them to Spark Streaming in order to collect some stats based on the TIMESTAMP (a sort of replay).
My question is if it is possible to process these event using the EVENT TIMESTAMP as temporal reference instead of the actual time of the machine (sorry for the silly question).
In case it is possible, will I need simply spark streaming or I need to switch to Structured Streaming?
I found a similar question here: Aggregate data based on timestamp in JavaDStream of spark streaming
Thanks in advance
Upvotes: 2
Views: 2714
Reputation: 1565
It is easy to do aggregations based on event-time with Spark SQL (in either batch or structured streaming). You just need to group by a time window over your timestamp column. For example, the following will bucket you data into 1 minute intervals and give you the count for each bucket.
df.groupBy(window($"timestamp", "1 minute") as 'time)
.count()
Upvotes: 0
Reputation: 2345
TL;DR
yes you could use either Spark Streaming or Structured Streaming, but I wouldn't if I were you.
Detailed answer
Sorry, no simple answer to this one. Spark Streaming might be better for the per-event processing if you need to individually examine each event. Structured Streaming will be a nicer way to perform aggregations and any processing where per-event work isn't necessary.
However, there is a whole bunch of complexity in your requirements, how much of the complexity you address depends on the cost of inaccuracy in the Streaming job output.
Spark Streaming makes no guarantee that events will be processed in any kind of order. To impose ordering, you will need to setup a window in which to do your processing that minimises the risk of out-of-order processing to an acceptable level. You will need to use a big enough window of data to accurately capture your temporal ordering.
You'll need to give these points some thought:
My opinion is that, unless you have relaxed constraints over accuracy, Spark is not the right tool for the job.
I hope that helps in some way.
Upvotes: 3