Guo
Guo

Reputation: 1803

Does spark streaming must finish processing previous batch of data, and then it can process the next batch of data, is it right?

I set spark streaming time interval is 5s. if current 5s receive very very much data, and spark streaming can't finish in 5s, but next batch of data is coming.

Does spark streaming will process next batch of data in the same time?

I mean does batches will execute in parallel?

Upvotes: 3

Views: 2655

Answers (2)

Mitchell Tracy
Mitchell Tracy

Reputation: 1551

Spark streaming handles one batch at a time. Additionally, the individual data items within each batch are processed in their order within the batch. By default, if spark doesn't have enough time to get to all of the data items in a batch when the next one comes, those data items will be dropped.

However, if you use a more advanced connection to your stream such as Kafka, spark can handle a pending batch once it finishes the current. This causes batches to be built up in Kafka, and this build up is called "back pressure", and it too can build up to the point where Kafka must start dropping data as well.

If you are not using an advanced connection such as Kafka, and your data stream is "bursty", meaning that there are periods of high input rates, you may want to increase your batch times to minimize data loss.

Upvotes: 6

z-star
z-star

Reputation: 690

Spark streaming is a time basses pipeline. First come first served. So it does will not process to adjacent batches together, as it handles each batch in the best way it can including distributing the work. The better case is that it will handle a pending batch once it finishes the current. This is called back pressure and will work using certain receivers such as Kafka. If not it will simply loose this data.

Upvotes: 2

Related Questions