Reputation: 3199
I use Apache Spark 2.4.1 and kafka data source.
Dataset<Row> df = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", SERVERS)
.option("subscribe", TOPIC)
.option("startingOffsets", "latest")
.option("auto.offset.reset", "earliest")
.load();
I have two sinks: raw data stored in hdfs location , after few transformations final data is stored in Cassandra table. The checkpointLocation
is an HDFS directory.
When starting the streaming query it gives the warning below:
2019-12-10 08:20:38,926 [Executor task launch worker for task 639] WARN org.apache.spark.sql.kafka010.InternalKafkaConsumer - Some data may be lost. Recovering from the earliest offset: 470021 2019-12-10 08:20:38,926 [Executor task launch worker for task 639] WARN org.apache.spark.sql.kafka010.InternalKafkaConsumer - The current available offset range is AvailableOffsetRange(470021,470021). Offset 62687 is out of range, and records in [62687, 62727) will be skipped (GroupId: spark-kafka-source-1fba9e33-165f-42b4-a220-6697072f7172-1781964857-executor, TopicPartition: INBOUND-19). Some data may have been lost because they are not available in Kafka any more; either the data was aged out by Kafka or the topic may have been deleted before all the data in the topic was processed. If you want your streaming query to fail on such cases, set the source option "failOnDataLoss" to "true".
I also used auto.offset.reset
as latest
and startingOffsets
as latest
.
2019-12-11 08:33:37,496 [Executor task launch worker for task 1059] WARN org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,_INBOUND-19)
What is this telling me? How to get rid of the warning (if possible)?
Upvotes: 1
Views: 4453
Reputation: 74739
Some data may be lost. Recovering from the earliest offset: 470021
The above warning happens when your streaming query started with checkpointed offsets that are past what's currently available in topics.
In other words, the streaming query uses checkpointLocation
with state that is no longer current and hence the warning (not error).
That means that your query is too slow compared to cleanup.policy (retention or compaction).
Upvotes: 4