Spark Structured Streaming Kafka Integration Offset management

Question

The documentation says:

enable.auto.commit: Kafka source doesn’t commit any offset.

Hence my question is, in the event of a worker or partition crash/restart :

This is seems to be quite important. Any indication on how to deal with it ?

Lalit · Accepted Answer

I also ran into this issue.

You're right in your observations on the 2 options i.e.

However...

There is the option of checkpointing by adding the following option:

.writeStream . .option("checkpointLocation", "path/to/HDFS/dir") .

In the event of a failure, Spark would go through the contents of this checkpoint directory, recover the state before accepting any new data.

Hope this helps!

Answers (1)