Reputation: 129
I have got a spark structured streaming job which consumes events coming from the azure event hubs service. In some cases it happens, that some batches are not processed by the streaming job. In this case there can be seen the following logging statement in the structured streaming log:
INFO FileStreamSink: Skipping already committed batch 25
the streaming job persists the incoming events into an Azure Datalake, so I can check which events have actually been processed/persisted. When the above skipping happens, these events are missing!
It is unclear to me, why these batches are marked as already committed, because in the end it seems like they were not processed!
Do you have an idea what might cause this behaviour?
Thanks!
Upvotes: 2
Views: 1215
Reputation: 455
We had the same issue and the Kafka broker already deleted the data. So to force the Spark application to start from the beginning (latest offset in Kafka) we deleted both the checkpoint
and _spark_metadata
directories. You can find _spark_metadata
in the same path where you write the stream.
Upvotes: 3
Reputation: 129
I could solve the issue. The problem was that I had two different streaming jobs which had different checkpoint locations (which is correct) but used the same base folder for their output. But in the output folder there is also saved meta information and so the two streams shared the information which batches they had already committed. After using a different base output folder the issue was fixed.
Upvotes: 2