sergey_o
sergey_o

Reputation: 309

Late data handling in Spark past watermark

Is there a way in Spark to handle data that arrived past watermark?

Consider a use case of devices that send messages, and those messages need to be processed inside Kafka + Spark. While 99% of messages are delivered to Spark server within let us 10 minutes, but occasionally a device may go out of connectivity zone for a day or a week and buffer messages internally, and then once a connection is restored deliver them a week later.

Watermark interval necessarily has to be fairly limited, as (1) results in the mainline case have to be produced timely, and also (2) because buffering space inside Spark is limited too, so Spark cannot keep a week worth of messages for all the devices buffered in a week-long watermark window.

In a regular Spark streaming construct, messages past watermark are discarded.

Is there a way to intercept those "very late" messages and route them to a handler or a separate stream -- only those "rejected" messages that do not fall within the watermark?

Upvotes: 0

Views: 516

Answers (1)

Ged
Ged

Reputation: 18023

No, there is not. Apache Flink can handle such things I seem to remember. Spark has no feed for dropped data.

Upvotes: 1

Related Questions