MaatDeamon
MaatDeamon

Reputation: 9771

Order Guarantee with Sparking Streaming

I am trying to get some change event from Kafka that I would like to propagate downstream in another system. However the Change order matters. Hence I wonder what is the appropriate way to do that with some Spark transformation in the middle.

The only thing I see is to loose the parallelism and make the DStream on one partition. Maybe there is a way to do operation in parallel and bring everything back in one partition and then send it to the external system or back in Kafka and then use a Kafka Sink for the matter.

What approach can I try?

Upvotes: 1

Views: 819

Answers (1)

rakesh
rakesh

Reputation: 2051

In a distributed environment, with some form of cashing/buffering at most layer, message generated from same machine may reach back-end in different order. Also the definition of order is subjective. Implementing a global definition of order will be restrictive (may not be correct) for the data as a whole.

So, Kafka is meant for keeping the data in order in the order of put but partition comes as a catch!!! Partition defines the level of parallelism per topic.

Typically, the level of abstraction at which kafka is kept, it should not bother much about order. It should be optimised for maximum throughput, where partitioning will come handy!!! Consider ordering just a side effect of supporting streaming!!!

Now, what ever logic ensures, that data is put in to kafka in order, that makes more sense in your application (spark job).

Upvotes: 0

Related Questions