Reputation: 2590
I hope we can get advice from the smart people here
we have hadoop cluster and 5 data-nodes machines ( workers machines )
our HDFS size is almost 80T
, and we have 98%
used capacity !!!
from economic side we cant increase the HDFS size , by adding disks to the data-nodes
so we are thinking to decrease the HDFS replication factor from 3 to 2
lets do a simulation ,
if we decrease the hdfs replication factor from 3 to 2 , its means that we have only 2 backup of each data
but the question is - the third data that was create from previous 3 replication factor still exists in HDFS disks
so how HDFS know to delete the third data? or is it something that HDFS know to do?
or maybe - no any option to delete the old data that create because the previews replication factor ?
Upvotes: 1
Views: 1159
Reputation: 595
In general 3 is the recommended replication factor. If you need to though, there's a command to change the replication factor of existing files in HDFS:
hdfs dfs -setrep -w <REPLICATION_FACTOR> <PATH>
The path can be a file or directory. So, to change the replication factor of all existing files from 3 to 2 you could use:
hdfs dfs -setrep -w 2 /
Note that -w
will force the command to wait until the replication has changed for all files. With terabytes of data this will take a while.
To check that the replication factor has changed you can use hdfs fsck /
and have a look at "Average block replication". It should have changed from 3 to 2.
Have a look at the command's docs for more details.
You can change the default replication factor which will be used for new files by updating hdfs-site.xml
.
Upvotes: 1