Reputation: 2110
I am new to kafka and just doing some tweaks in my local machine following the docs,
Let's say I have 3 topics T1
, T2
and T3
.
T1
has 2 partitions,
T2
has 3 partitions,
T3
has 5 partitions
and,
I have two brokers B1
and B2
.
Will kafka manage assigning brokers to topics/partions automatically ? If yes, how?
Upvotes: 1
Views: 1751
Reputation: 18495
According to the Kafka documentation on Replica Management the assignment of partitions is happening on a round-robin fashion:
We attempt to balance partitions within a cluster in a round-robin fashion to avoid clustering all partitions for high-volume topics on a small number of nodes. Likewise we try to balance leadership so that each node is the leader for a proportional share of its partitions.
Over time, after adding/removing topics and or brokers to the cluster, this will usually lead to unbalances and inefficiencies that you should take care of as an operator of the platform. Also, the assignment of partitions to brokers will not happen based on any kind of information like data volume or number of read/write accesses.
To take care of these unbalances Kafka comes with a command line tool kafka-reassign-partitions.sh
. An example can be found in the Kafka documentation on Automatically migrating data to new machines. The licensed Confluent Platform has the Auto Data Balancer.
Upvotes: 0
Reputation: 39820
Every topic, is a particular stream of data (similar to a table in a database). Topics, are split into partitions (as many as you like) where each message within a partition gets an incremental id, known as offset as shown below.
Partition 0:
+---+---+---+-----+
| 0 | 1 | 2 | ... |
+---+---+---+-----+
Partition 1:
+---+---+---+---+----+
| 0 | 1 | 2 | 3 | .. |
+---+---+---+---+----+
Now a Kafka cluster is composed of multiple brokers. Each broker is identified with an ID and can contain certain topic partitions.
Example of 2 topics (each having 3 and 2 partitions respectively):
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 2 |
| Partition 1 |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 2 |
| |
| |
| Topic 2 |
| Partition 0 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Note that data is distributed (and Broker 3 doesn't hold any data of topic 2).
Topics, should have a replication-factor
> 1 (usually 2 or 3) so that when a broker is down, another one can serve the data of a topic. For instance, assume that we have a topic with 2 partitions with a replication-factor
set to 2 as shown below:
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| |
| |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 1 |
| Partition 0 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Now assume that Broker 2 has failed. Broker 1 and 3 can still serve the data for topic 1. So a replication-factor
of 3 is always a good idea since it allows for one broker to be taken down for maintenance purposes and also for another one to be taken down unexpectedly. Therefore, Apache-Kafka offers strong durability and fault tolerance guarantees.
Note about Leaders:
At any time, only one broker can be a leader of a partition and only that leader can receive and serve data for that partition. The remaining brokers will just synchronize the data (in-sync replicas). Also note that when the replication-factor
is set to 1, the leader cannot be moved elsewhere when a broker fails. In general, when all replicas of a partition fail or go offline, the leader
will automatically be set to -1
.
Note about retention period If you are planning to use Kafka as a storage you also need to be aware of the configurable retention period for every topic. If you don't take care of this setting, you might lose your data. According to the docs:
The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space.
Upvotes: 2