Reputation: 59
I was attending a course on Hadoop and MapReduce on Udacity.com and the instructor mentioned that In HDFS to reduce the point of failures each block is replicated 3 times in Database. Is it true for real?? Does it mean that If I have 1 petabytes of Logs I will need 3 Petabytes of Storage?? Beacuse that will cost me more
Upvotes: 0
Views: 1179
Reputation: 398
By default, HDFS conf parameter dfs.replication
is set with value 3. That allow fault tolerance, disponibility, etc... (All parameters of HDFS here)
But in install time, you could set the parameter in 1, and HDFS don't make replicas of your data. With dfs.replication=1, 1 petabyte is storaged in the same space amount.
Upvotes: 1
Reputation: 1809
This is because HDFS replicates data when you store it. The default replication factor for hdfs is 3, which you can find in hdfs-site.xml file under dfs.replication property. You can set this value to 1 or 5 as per your requirement.
Data replication is much useful as if some node particularly goes down, you will have the copy of data available on other node/nodes for processing.
Upvotes: 0
Reputation: 294407
Yes, is true, HDFS requires space for each redundant copy and requires copies to achieve failure tolerance and data locality during processing.
But this is not necessarily true about MapReduce, which can run on other file systems like S3 or Azure blobs for instance. It is HDFS that requires the 3 copies.
Upvotes: 1
Reputation: 37083
Yes that's true. So say if you have say 4 machines with datanodes running on them, then by default replication will happen in other two machines at random as well. If you don't want that, you can switch it to 1 by setting dfs.replication
property in hdfs-site.xml
Upvotes: 0