Reputation: 15
I have a data of 5 TB and actual size of the whole size of combined cluster is 7 TB and I have set the Replication factor to 2.
In this case how it will replicate the data?
Due to the Replication factor the minimum size of the storage on the cluster(Nodes) should be always double the size of the Data,Do you think this is a drawback in Hadoop?
Upvotes: 1
Views: 6580
Reputation: 49
Replication of the data is not a drawback of Hadoop -- it's the factor that increases the efficiency of Hadoop (HDFS). Replication of data to a larger number of slave nodes provides high availability and good fault tolerance to the cluster. If we consider the losses incurred by the client due to downtime of nodes in the cluster (typically will be in millions of $), the cost spent for buying the extra storage facility required for replication of data is much less. So the replication of data is justified.
Upvotes: 1
Reputation: 11721
If your minimum size of storage on the cluster is not double the size of your data, then you will end up having under-replicated blocks. Under-replicated block are those which are replicated < replication factor, so if you're replication factor is 2, you will have blocks will have replication factor of 1.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Consider that one of the nodes in your cluster goes down. That node would have some data stored in it and if you do not replicate your data, then a part of your data will not be available due to the node failure. However, if your data is replicated, the data which was on the node which went down will still be accessible to you from other nodes.
If you do not feel the need to replicate your data, you can always set your replication factor = 1.
Upvotes: 4
Reputation:
This is the case of under replication. Assume you have 5 blocks. HDFS was able to create the replicas only for first 3 blocks because of space constraint. Now the other two blocks are under replicated. When the HDFS finds sufficient space, it will try to replicate the 2 blocks also.
Upvotes: 0