GANdalf85
GANdalf85

Reputation: 129

Skipping of batches in spark structured streaming process

I have got a spark structured streaming job which consumes events coming from the azure event hubs service. In some cases it happens, that some batches are not processed by the streaming job. In this case there can be seen the following logging statement in the structured streaming log:

INFO FileStreamSink: Skipping already committed batch 25

the streaming job persists the incoming events into an Azure Datalake, so I can check which events have actually been processed/persisted. When the above skipping happens, these events are missing!

It is unclear to me, why these batches are marked as already committed, because in the end it seems like they were not processed!

Do you have an idea what might cause this behaviour?

Thanks!

Upvotes: 2

Views: 1215

Answers (2)

Gara Walid
Gara Walid

Reputation: 455

We had the same issue and the Kafka broker already deleted the data. So to force the Spark application to start from the beginning (latest offset in Kafka) we deleted both the checkpoint and _spark_metadata directories. You can find _spark_metadata in the same path where you write the stream.

Upvotes: 3

GANdalf85
GANdalf85

Reputation: 129

I could solve the issue. The problem was that I had two different streaming jobs which had different checkpoint locations (which is correct) but used the same base folder for their output. But in the output folder there is also saved meta information and so the two streams shared the information which batches they had already committed. After using a different base output folder the issue was fixed.

Upvotes: 2

Related Questions