Reputation: 423
It's unclear to me (and I haven't managed to find any documentation that makes it perfectly clear) how compression affects kafka configurations that deal with bytes.
Take a hypothetical message that is exactly 100 bytes, a producer with a batch size of 1000 bytes, and a consumer with a fetch size of 1000 bytes.
With no compression it seems pretty clear that my producer would batch 10 messages at a time and my consumer would poll 10 messages at a time.
Now assume a compression (specified at the producer -- not on the broker) that (for simplicity) compresses to exactly 10% of the uncompressed size.
With that same config, would my producer still batch 10 messages at a time, or would it start batching 100 messages at a time? I.e. is the batch size pre- or post-compression? The docs do say this:
Compression is of full batches of data
...which I take to mean that it would compress 1000 bytes (the batch size) down to 100 bytes and send that. Is that correct?
Same question for the consumer fetch. Given a 1K fetch size, would it poll just 10 messages at a time (because the uncompressed size is 1K) or would it poll 100 messages (because the compressed size is 1K)? I believe that the fetch size will cover the compressed batch, in which case the consumer would be fetching 10 batches as-produced-by-the-producer at a time. Is this correct?
It seems confusing to me that, if I understand correctly, the producer is dealing with pre-compression sizes and the consumer is dealing with post-compression sizes.
Upvotes: 3
Views: 2518
Reputation: 8335
It's both simpler and more complicated ;-)
It's simpler in that both the producer and the consumer compresses and uncompresses the same Kafka Protocol Produce Requests and Fetch Requests and the broker just stores them with zero copy in their native wire format. Kafka does not compress individual messages before they are sent. It waits until a batch of messages (all going to the same partition) are ready for send and then compresses the entire batch and sends it as one Produce Request.
It's more complicated because you also have to factor in the linger time which will trigger a send of a batch of messages earlier than when the producer buffer size is full. You also have to consider that messages may have different keys, or for other reasons be going to different topic partitions on different brokers so it's not true to say that qty(10) records compressed to 100 bytes each go all as one batch to one broker as a single produce request of 1000 bytes (unless all the messages are being sent to a topic with a single partition).
From https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/producer/KafkaProducer.html
The producer maintains buffers of unsent records for each partition. These buffers are of a size specified by the batch.size config. Making this larger can result in more batching, but requires more memory (since we will generally have one of these buffers for each active partition).
By default a buffer is available to send immediately even if there is additional unused space in the buffer. However if you want to reduce the number of requests you can set linger.ms to something greater than 0. This will instruct the producer to wait up to that number of milliseconds before sending a request in hope that more records will arrive to fill up the same batch. This is analogous to Nagle's algorithm in TCP. For example, in the code snippet above, likely all 100 records would be sent in a single request since we set our linger time to 1 millisecond. However this setting would add 1 millisecond of latency to our request waiting for more records to arrive if we didn't fill up the buffer. Note that records that arrive close together in time will generally batch together even with linger.ms=0 so under heavy load batching will occur regardless of the linger configuration; however setting this to something larger than 0 can lead to fewer, more efficient requests when not under maximal load at the cost of a small amount of latency.
Upvotes: 3