Reputation: 178
I'm new to distributed stream processing (Spark). I've read some tutorials/examples which cover how backpressure results in the producer(s) slowing down in response to overloaded consumers. The classic example given is ingesting and analyzing tweets. When there is an unexpected spike in traffic such that the consumers are unable to handle the load, they apply backpressure and the producer responds by adjusting its rate lower.
What I don't really see covered is what approaches are used in practice to deal with the massive amount of incoming real-time data which cannot be immediately processed due to the lower capacity of the entire pipeline?
I imagine the answer to this is business domain dependent. For some problems it might be fine to just drop that data, but in this question I would like to focus on a case where we don't want to lose any data.
Since I will be working in an AWS environment, my first thought would be to "buffer" the excess data in an SQS queue or a Kinesis stream. Is it as simple as this in practice, or this there a more standard streaming solution to this problem (perhaps as part of Spark itself)?
Upvotes: 1
Views: 1942
Reputation: 16215
"Is there a more standard streaming solution?" - Maybe. There are a lot of different ways to do this, not immediately clear if there is a "standard" yet. This is just an opinion though, and you're not likely to get a concrete answer for this part.
"Is it as simple as this in practice?" - SQS and Kinesis have different usage patterns:
For your use case where you have a "massive amount of incoming real-time data which cannot be immediately processed", I'd focus your efforts on Kinesis over SQS, as the Kinesis model also aligns better with other streaming mechanisms like Spark / Kafka.
Upvotes: 3