Reputation: 413
Suppose: I have a hadoop cluster in which each data slide is set to have 3 copies.
One day, a datanode is unplugged (assume data stored inside is fine), and then hdfs will generate new copies for data stored in this node so that data slide still have 3 copies. But if datanode is plugged in again the next day, some data slide have 4 copies, then hdfs has to delete 1 of 4 copies.
My question is how does hdfs choose the one to delete? randomly? or just delete the newest one (that means the datanode will be cleared)?
Upvotes: 1
Views: 105
Reputation: 35444
Question: But if Datanode is repaired and starts to work again, some data slide have 4 copies, then HDFS has to delete 1 of 4 copies
As you mentioned, In HDFS when any Datanode unplugged balancer will create the lost copy in another node to maintain proper replication factor for blocks.
Now if we want to include the same/diff node to HDFS, We do format
and include the node to cluster. So, there will not be excessive replicated blocks in the cluster at any point of time.
Upvotes: 1
Reputation: 895
The data in the datanode is wiped out when it crashes.So thats the reason in HDFS,the replication is kept to ensure the data availability is always present in case of datanode failure.
Upvotes: 0