BdEngineer
BdEngineer

Reputation: 3199

What is the meaning of "OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions"?

I use Apache Spark 2.4.1 and kafka data source.

Dataset<Row> df = sparkSession
  .readStream()
  .format("kafka")
  .option("kafka.bootstrap.servers", SERVERS)
  .option("subscribe", TOPIC) 
  .option("startingOffsets", "latest")
  .option("auto.offset.reset", "earliest") 
  .load();

I have two sinks: raw data stored in hdfs location , after few transformations final data is stored in Cassandra table. The checkpointLocation is an HDFS directory.

When starting the streaming query it gives the warning below:

2019-12-10 08:20:38,926 [Executor task launch worker for task 639] WARN org.apache.spark.sql.kafka010.InternalKafkaConsumer - Some data may be lost. Recovering from the earliest offset: 470021 2019-12-10 08:20:38,926 [Executor task launch worker for task 639] WARN org.apache.spark.sql.kafka010.InternalKafkaConsumer - The current available offset range is AvailableOffsetRange(470021,470021). Offset 62687 is out of range, and records in [62687, 62727) will be skipped (GroupId: spark-kafka-source-1fba9e33-165f-42b4-a220-6697072f7172-1781964857-executor, TopicPartition: INBOUND-19). Some data may have been lost because they are not available in Kafka any more; either the data was aged out by Kafka or the topic may have been deleted before all the data in the topic was processed. If you want your streaming query to fail on such cases, set the source option "failOnDataLoss" to "true".

I also used auto.offset.reset as latest and startingOffsets as latest.

2019-12-11 08:33:37,496 [Executor task launch worker for task 1059] WARN org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,_INBOUND-19)

What is this telling me? How to get rid of the warning (if possible)?

Upvotes: 1

Views: 4453

Answers (1)

Jacek Laskowski
Jacek Laskowski

Reputation: 74739

Some data may be lost. Recovering from the earliest offset: 470021

The above warning happens when your streaming query started with checkpointed offsets that are past what's currently available in topics.

In other words, the streaming query uses checkpointLocation with state that is no longer current and hence the warning (not error).

That means that your query is too slow compared to cleanup.policy (retention or compaction).

Upvotes: 4

Related Questions