Reputation: 2495
There is a microservice, which receives the batch of the messages from the outside and push them to kafka. Each message is sent separately, so for each batch I have around 1000 messages 100 bytes each. It seems like the messages take much more space internally, because the free space on the disk going down much faster than I expected.
I'm thinking about changing the producer logic, the way it will put all the batch in one message (the consumer then will split them by itself). But I haven't found any information about space or performance issues with many small messages, neither any guildlines about balance between size and count. And I don't know Kafka enough to have my own conclusion.
Thank you.
Upvotes: 4
Views: 2394
Reputation: 140
Kafka is dedicated to the processing of large short message flows with high throughput. Under the carpet, Kafka sends requests over TCP/IP to brokers, and consumers pull the messages. Remember that the size of a TCP/IP packet is about 1.5K. Thus, Kafka brokers should allocate more memory buffers, merge messages, and trail send /receive operations when working with large messages. This leads to performance degradation. A concrete numbers depend on the power of nodes in a cluster and network throughput, but usually, degradation is felt after 20K. Hundreds of K are bad, and megas are killers. Sending messages in batches does not mean sending huge messages. Yet once: batching != large message. Batch should be used when a set of small messages must be sent. For example {"don't send a large messages", "it kills performance"} are small messages and can be sent even in one TCP packet. So why to send 2 packets? This does not mean as well, that messages must be just for one TCP/IP packet because requests to send/pull acknowledges also require interoperations between client clusters, which spends the broker's time and resources. From the another hand, as we mentioned already, accepting multipacket messages also requires additional GC work and merging operations on brokers. So we must consider multiple factors when selecting an optional message (or batch) size, but definitely, we are not about even dozens of K,
Upvotes: 0
Reputation: 11850
The producer
will, by itself, batch messages that are destined to the same partition, in order to avoid unnecesary calls.
The producer makes this thanks to its background threads. In the image, you can see how it batches 3 messages before sending them to each partition.
If you also set compression in the producer-side, it will also compress (GZip, LZ4, Snappy are the valid codecs) the messages before sending it to the wire. This property can also can be set on the broker-side (so the messages are sent uncompressed by the producer, and compressed by the broker).
It depends on your network capacity to decide wether you prefer a slower producer (as the compression will slow it) or bigger load on the wire. Note that setting a big compression level on big files may affect a lot your overall performance.
Anyway, I believe the big/small msg problem hurts a lot more to the consumer
side; Sending messages to Kafka is easy and fast (the default behaviour is async, so the producer
won't be too busy). But on the consumer
side, you'll have to look the way you are processing the messages:
Here you couple consuming with processing. This is the simplest way: the consumer sets its own thread, reads a kafka msg and process it. Then continues the loop.
Here you decouple consuming and processing. In most cases, reading from kafka will be faster than the time you need to process the message. It is just physics. In this approach, one consumer feeds many separate worker threads that share the processing load.
More info about this here, just above the Constructors
area.
Why do I explain this? Well, if your messages are too big, and you choose the first option, your consumer may not call poll()
within the timeout interval, so it will rebalance continuosly. If your messages are big (and take some time to be processed), better choose to implement the second option, as the consumer will continue its own way, calling poll()
without falling in rebalances.
If the messages are too big and too many, you may have to start thinking about different structures than can buffer the messages into your memory. Pools
, deques
, queues
, for example, are different options to acomplish this.
You may also increase the poll timeout interval. This may hide you about dead consumers, so I don't really recommend it.
So my answer would be: it depends, basicallty on: your network capacity, your required latency, your processing capacity. If you are able to process big messages equally fast as smaller ones, then I wouldn't care much.
Maybe if you need to filter and reprocess older messages I'd recommend partitioning the topics and sending smaller messages, but it's only a use-case.
Upvotes: 4