Reputation: 193
If I have an application that publishes events on a kafka topic and my consumers need to read the data in the order they were published, then my topic can have only one partition, since kafka guarantees ordering only within partitions.
However, I read that kafka uses partitioning to provide scalability, i.e. by placeing partitions of a topic on several brokers. I also read, that a partition itself can not be split.
Since ordering is only possible within a partition, is scalability a problem for my application? Is there a way to deal with this problem or is my understanding of Kafka not right?
Imagine my application has thousands of consumers (each in a single group so everyone consumes the published events). All need to read data from that single topic with that single partition.
EDIT: Another thing that comes to my mind is: Imagine having 5 partitions of that topic, and all consumers must still read the right ordering. If the publishers dont specify an partition id or a key, then kafka will publish the information round-robin on the 5 partitions right?
If all consumers are in a single group and all subscribe to the topic, then each consumer reads events of all topics, which means that they would still get the ordered messages, right?
Upvotes: 1
Views: 1257
Reputation: 512
Point 1) If your requirement is to process all records in sequence only than its not possible using parallel processing as no where parallel processing guarantees the sequence.
Point 2) Yes in kafka sequence will only be guarantee with all the records sends with same key. So analyse data if related data can be segregated where you truly required sequence processing. and send only those related data with same key. and send other related data with another key.
Point 3) Now if you are able to segregate your data in with different keys than you will have to increase no of partitions. and accordingly consumers as well. so for e.g. you have 3 partitions than you can scale your application with 3 consumers. (note that you are producing records with key to obey your sequencing). all 3 consumers assign with 1 partitions and your parallel processing will be achieved. (That will only guarantee of processing in sequence of records with same key).
Point 4)
Imagine my application has thousands of consumers (each in a single group so everyone consumes the published events). All need to read data from that single topic with that single partition.
if your all(thousands) of consumer reading in same group and reading from single partition topic than only one consumer will be assigned with one partition and rest all(thousands - 1) consumers will be sitting idle doing nothing.
if you assign different groups to all consumers than all consumers will be assigned with that single partition topic and all consumers individually process all records so there will be duplicate processing.
point 5)
If all consumers are in a single group and all subscribe to the topic, then each consumer reads events of all topics, which means that they would still get the ordered messages, right?
No as describe in point 4) its not guarantee that all records will be in order as its being processed by different consumers.
Summary : If you can gather records and send it with same key where you actually required sequencing than that will guarantee sequencing. If your requirement is to consume all the records in sequence only than its problem of sequence processing only, and parallel processing can not be achieved here.
Upvotes: 2