Reputation: 1339
Sometimes, the data blocks are stored in imbalanced way across the data node. Based on HDFS block placement policy, the first replica is favored to be stored on the writer node (i.e. the client node), then the second replica is stored on a remote rack and the third one is stored on a local rack. What are the use cases that make the data blocks unbalanced across the data nodes under this placement policy? one possible reason in mind that if the writer nodes are few, then one replica of the data blocks will be stored on these nodes. Are there any other reasons ?
Upvotes: 2
Views: 1196
Reputation:
Here are some potential reasons for data skew:
The "hdfs balancer" command allows admins to rebalance the cluster. Also, https://issues.apache.org/jira/browse/HDFS-1804 added a new block storage policy that takes into account free space left on the volume.
Upvotes: 5