Sokrates
Sokrates

Reputation: 93

Spark Streaming - TIMESTAMP field based processing

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.

The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTAMP.

At the moment I'm loading this file and extracting all the events in a JavaRDD and I would like to pass them to Spark Streaming in order to collect some stats based on the TIMESTAMP (a sort of replay).

My question is if it is possible to process these event using the EVENT TIMESTAMP as temporal reference instead of the actual time of the machine (sorry for the silly question).

In case it is possible, will I need simply spark streaming or I need to switch to Structured Streaming?

I found a similar question here: Aggregate data based on timestamp in JavaDStream of spark streaming

Thanks in advance

Upvotes: 2

Views: 2714

Answers (2)

Michael Armbrust
Michael Armbrust

Reputation: 1565

It is easy to do aggregations based on event-time with Spark SQL (in either batch or structured streaming). You just need to group by a time window over your timestamp column. For example, the following will bucket you data into 1 minute intervals and give you the count for each bucket.

df.groupBy(window($"timestamp", "1 minute") as 'time)
  .count()

Upvotes: 0

ImDarrenG
ImDarrenG

Reputation: 2345

TL;DR

yes you could use either Spark Streaming or Structured Streaming, but I wouldn't if I were you.

Detailed answer

Sorry, no simple answer to this one. Spark Streaming might be better for the per-event processing if you need to individually examine each event. Structured Streaming will be a nicer way to perform aggregations and any processing where per-event work isn't necessary.

However, there is a whole bunch of complexity in your requirements, how much of the complexity you address depends on the cost of inaccuracy in the Streaming job output.

Spark Streaming makes no guarantee that events will be processed in any kind of order. To impose ordering, you will need to setup a window in which to do your processing that minimises the risk of out-of-order processing to an acceptable level. You will need to use a big enough window of data to accurately capture your temporal ordering.

You'll need to give these points some thought:

  • If a batch fails and is retried, how will that affect your counters?
  • If events arrive late, will you ignore them, re-process the whole affected window, or update the output? If the latter how can you guarantee the update is done safely?
  • Will you minimise risk of corruption by keeping hold of a large window of events, or accept any inaccuracies that may arise from a smaller window?
  • Will the partitioning of events cause complexity in the order that they are processed?

My opinion is that, unless you have relaxed constraints over accuracy, Spark is not the right tool for the job.

I hope that helps in some way.

Upvotes: 3

Related Questions