MaatDeamon
MaatDeamon

Reputation: 9771

Spark Structured Streaming Kafka Integration Offset management

The documentation says:

enable.auto.commit: Kafka source doesn’t commit any offset.

Hence my question is, in the event of a worker or partition crash/restart :

  1. startingOffsets is set to latest, how do we not loose messages ?
  2. startingOffsets is set to earliest, how do we not reprocess all messages ?

This is seems to be quite important. Any indication on how to deal with it ?

Upvotes: 1

Views: 846

Answers (1)

Lalit
Lalit

Reputation: 2014

I also ran into this issue.

You're right in your observations on the 2 options i.e.

  • potential data loss if startingOffsets is set to latest
  • duplicate data if startingOffsets is set to earliest

However...

There is the option of checkpointing by adding the following option:

.writeStream .<something else> .option("checkpointLocation", "path/to/HDFS/dir") .<something else>

In the event of a failure, Spark would go through the contents of this checkpoint directory, recover the state before accepting any new data.

I found this useful reference on the same.

Hope this helps!

Upvotes: 3

Related Questions