Reputation: 383
Balancer iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization.
Will that affect the concept of Rack awarness ?
For example I have three machines placed in two racks and data is placed by following the concept of rack awarness.
What would happen if I add a new machine to the cluster and run the balancer command?
Upvotes: 0
Views: 1008
Reputation: 682
If your question is how load balancing is used: Load balancing is helpful in spreading the load equally across the free nodes when a node is loaded above its threshold level.
Now A cluster is considered balanced if for each data node, the ratio of used space at the node to the total capacity of node (known as the utilization of the node) differs from the the ratio of used space at the cluster to the total capacity of the cluster (utilization of the cluster) by no more than the threshold value.
When you apply load balancing during runtime, it is called dynamic load balancing and this can be realized both in a direct or iterative manner according to the execution node selection:
Rack Awareness
Rack Awareness prevents losing data when an entire rack fails and allows to make use of bandwidth from multiple racks when reading a file.
On Multiple rack cluster, block replications are maintained with a policy that no more than one replica is placed on one node and no more than two replicas are placed in the same rack with a constraint that number of racks used for block replication should be always less than total no of block replicas.
For example,
It minimizes the write cost and maximizing read speed.
Upvotes: 0
Reputation: 191681
Rack awareness & data locality is a YARN concept. The HDFS balancer only cares about leveling out the Datanode usage.
If you have 3 machines, with 3 replicas by default, then every machine could be guaranteed to have 1 replica, therefore with 2 racks, you're practically guaranteed to have rack locality.
Node locality is more performant than rack awareness, anyway.
If you have 10 GB intra cluster speeds between nodes, data locality is a moot point. This is why AWS can still reasonably process data in S3, for example, where data locality processing is not available
Upvotes: 1