Is windowing based on event time possible with Spark Streaming?

Question

According to the Dataflow Model paper : A practical approach to balancing correctness, latency and cost in massive-scale, unbounded, out-of-order Data processing:

MillWheel and Spark Streaming are both sufficiently scalable, fault-tolerant, and low-latency to act as reasonable substrates, but lack high-level programming models that make calculating event-time sessions straightforward.

Is it always the case?

Ged · Accepted Answer

No, it is not.

To quote from https://dzone.com/articles/spark-streaming-vs-structured-streaming so as to save on my lunch time!:

One big issue in the streaming world is how to process data according to event-time.

Event-time is the time when the event actually happened. It is not necessary for the source of the streaming engine to prove data in real-time. There may be latencies in data generation and handing over the data to the processing engine. There is no such option in Spark Streaming to work on the data using the event-time. It only works with the timestamp when the data is received by the Spark. Based on the ingestion timestamp, Spark Streaming puts the data in a batch even if the event is generated early and belonged to the earlier batch, which may result in less accurate information as it is equal to the data loss.

On the other hand, Structured Streaming provides the functionality to process data on the basis of event-time when the timestamp of the event is included in the data received. This is a major feature introduced in Structured Streaming which provides a different way of processing the data according to the time of data generation in the real world. With this, we can handle data coming in late and get more accurate results.

With event-time handling of late data, Structured Streaming outweighs Spark Streaming.

Is windowing based on event time possible with Spark Streaming?

Answers (1)

Related Questions