jb62
jb62

Reputation: 2474

What does Kafka offset actually count?

I have Kafka topic that I want to replay from the beginning. I run the command to get the offset thus:

[zk: localhost:2181(CONNECTED) 56] get /kafka/consumers/foo43/offsets/statdxSolrXmlDocs/0

and I get this:

30024
cZxid = 0x36000000e8
ctime = Fri Feb 12 04:46:27 MST 2016
mZxid = 0x36000001ac
mtime = Fri Feb 12 05:14:49 MST 2016
pZxid = 0x36000000e8
cversion = 0
dataVersion = 6
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0

30024 bears absolutely no resemblance to the number of messages I pumped into the topic about 10 minutes earlier (that number was 3500)

So, my assumption that the actual number of messages would be reflected here seems to be wrong -- which is fine - but I'd like to know exactly what IS getting counted because when I issue this command

set /kafka/consumers/foo43/offsets/statdxSolrXmlDocs/0 1

I see the number change from 30024 to 1 -- and if I re-run my consumer microservice, I do get all 3500 messages again.

So - what, exactly, is Kafka counting? It can't be counting the messages exactly, and yet if it isn't counting messages per se, a re-set to 1 shouldn't work that well (I'm thinking)

Just in case it matters - I'm using a simple consumer in my microservice and I do see this number change whenever the simple consumer finishes it's run (having consumed all available messages)

I should add that there were no messages when I ran the --from-beginning command line statement just before pumping in those 3500 messages.

Upvotes: 0

Views: 3994

Answers (1)

jb62
jb62

Reputation: 2474

Based on my black box testing, Kafka clearly counts messages, NOT bytes or something else esoteric.

In my situation, I had re-run the same messages into the topic over and over again. From what I can see, if you "clear" a topic via the command line, the count of messages does not get re-set to zero.

What I did find is that if I take 3500 (the number of messages) away from 30024, I get exactly the same number I got by running this command (which gets you the lowest offset.)

bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic=statdxSolrXmlDocs --broker-list=192.168.56.3:9092  --time=-2

which returned 26524.

If I passed 26524 in on this command:

set /kafka/consumers/foo43/offsets/statdxSolrXmlDocs/0 26524

Then I get all 3500 messages again when I run my microservice.

So, in a nutshell, Kafka does count messages, including messages since deleted, but you can apparently trust the number it returns for the lowest offset.

To test this, I wiped everything from Zookeeper, re-installed Kafka, and re-sent all the messages into Kafka. In that case my lowest offset was 0 and my highest was the number of messages I'd sent in.

At one point I got an answer on the Kafka mailing list that the offset was bytes and not messages, but I don't see how that's the correct answer as my messages are NOT all the same size and so a byte count would not explain the observed results.

As a final note, I found that if I passed in 0 on the set command above, it still worked. I assume Kafka went looking for 0, then 1, then 2 and so on right up until it found an actual message at 26524. So being totally precise about the beginning offset if you really want everything from the beginning appears to not matter.

Upvotes: 4

Related Questions