Avoiding data loss when slow consumers force backpressure in stream processing (spark, aws)

Question

I'm new to distributed stream processing (Spark). I've read some tutorials/examples which cover how backpressure results in the producer(s) slowing down in response to overloaded consumers. The classic example given is ingesting and analyzing tweets. When there is an unexpected spike in traffic such that the consumers are unable to handle the load, they apply backpressure and the producer responds by adjusting its rate lower.

What I don't really see covered is what approaches are used in practice to deal with the massive amount of incoming real-time data which cannot be immediately processed due to the lower capacity of the entire pipeline?

I imagine the answer to this is business domain dependent. For some problems it might be fine to just drop that data, but in this question I would like to focus on a case where we don't want to lose any data.

Since I will be working in an AWS environment, my first thought would be to "buffer" the excess data in an SQS queue or a Kinesis stream. Is it as simple as this in practice, or this there a more standard streaming solution to this problem (perhaps as part of Spark itself)?

Avoiding data loss when slow consumers force backpressure in stream processing (spark, aws)

Answers (1)

Related Questions