Spark structured streaming: Current batch is falling behind

Question

It seems be very straightforward implementation, but looks like there are some issues.

This job reads offsets (ui event data) from kafka topic, does some aggregation and writes it to Aerospike database.

In case of high traffic I start seeing this issue where the job is running fine but no new data is being inserted. Looking at the logs I see this WARNING messages:

Current batch is falling behind. The trigger interval is 30000 milliseconds, but spent 43491 milliseconds

Few times job resumes writing data but I can see the counts are low which indicates that there is some data loss.

Here is the code:

Dataset stream = sparkSession.readStream()
          .format("kafka")
          .option("kafka.bootstrap.servers", kafkaBootstrapServersString)
          .option("subscribe", newTopic)
          .option("startingOffsets", "latest")
          .option("enable.auto.commit", false)
          .option("failOnDataLoss", false)
          .load();
StreamingQuery query = stream
        .writeStream()
        .option("startingOffsets", "earliest")
        .outputMode(OutputMode.Append())
        .foreach(sink)
        .trigger(Trigger.ProcessingTime(triggerInterval))
        .queryName(queryName)
        .start();

Spark structured streaming: Current batch is falling behind

Answers (1)

Related Questions