spark can not get message from Kafka with new groupId

Question

I'm using spark streaming to read message from Kafka, it works fine. But I had one requirement which needs to re-read the messages. I was thinking I may just need to change the spark's customer groupId and restart spark streaming app, it should reread the kafka message from beginning. But the result was that Spark could not get any messages, I'm confused. By Kafka document if you change the customer groupId then it should get message from beginning, because kafka treat you as a new customer. Thanks in advance!

Chris Gerken · Accepted Answer

Kafka consumers have a property called auto.offset.reset (See Kafka Doc). This tells the consumer what to do when it starts consuming but it hasn't committed an offset, yet. This is your case. The topic has messages, but there's no start offset stored because you haven't read anything under that new group id, yet. In this situation, the auto.offset.reset property is used. If the value is "largest", and this is the default), then the start position is set to the largest offset (the last) and you get the behavior you're seeing. If the value is "smallest" then the offset is set to the beginning offset and the consumer would read the entire partition. This is what you want.

So I'm not exactly sure how you'd set that Kafka property in your Spark app, but you definitely want that property set to "smallest" if you want the new group id to result in a read of the entire topic.

spark can not get message from Kafka with new groupId

Answers (2)

Related Questions