Reputation: 5890
I'm using spark streaming to read message from Kafka, it works fine. But I had one requirement which needs to re-read the messages. I was thinking I may just need to change the spark's customer groupId and restart spark streaming app, it should reread the kafka message from beginning. But the result was that Spark could not get any messages, I'm confused. By Kafka document if you change the customer groupId then it should get message from beginning, because kafka treat you as a new customer. Thanks in advance!
Upvotes: 1
Views: 861
Reputation: 611
Sounds like you are using spark streaming's receiver based api for Kafka. For that api auto.offset.reset only applies if there aren't offsets in ZK, as you noticed.
If you want to be able to specify the exact offsets, see the version of the createDirectStream call that takes fromOffsets as an argument.
Upvotes: 2
Reputation: 16392
Kafka consumers have a property called auto.offset.reset (See Kafka Doc). This tells the consumer what to do when it starts consuming but it hasn't committed an offset, yet. This is your case. The topic has messages, but there's no start offset stored because you haven't read anything under that new group id, yet. In this situation, the auto.offset.reset property is used. If the value is "largest", and this is the default), then the start position is set to the largest offset (the last) and you get the behavior you're seeing. If the value is "smallest" then the offset is set to the beginning offset and the consumer would read the entire partition. This is what you want.
So I'm not exactly sure how you'd set that Kafka property in your Spark app, but you definitely want that property set to "smallest" if you want the new group id to result in a read of the entire topic.
Upvotes: 2