Hadoop Optimization Suggestion

Question

Consider a scenario: If I increase the replication factor of the data I have in HDFS ; suppose in a 10 node cluster I make the RF = 5 instead of 3(default), will it increase the performance of my data processing tasks?

Will the map phase complete sooner compared to the default replication setting?

Will there be any effect on the reduce phase?

Manjunath Ballur · Accepted Answer

Impact of Replication on Storage:

Replication factor has a huge impact on the storage of the cluster. It's obvious that: Larger the replication factor, lesser the number of files you can store in the cluster.
If replication factor is 5, then for every 1 GB of data ingested into cluster, you will need 5 GB of storage space and you will quickly run out of space in the cluster.
Since NameNode stores all the meta information in memory, it will quickly run of space to store the meta data. Hence, your NameNode will have to be allocated more memory (check HADOOP_NAMENODE_OPTS).
Data copy operation will take more time, since data copy is daisy-chained across Data Nodes. Instead of 3 Data Nodes, now 5 Data Nodes will have to confirm data storage, before a write/append is committed

Impact of Replication on Computation:

Mapper:

With a higher replication factor, there are more options to schedule a mapper. With a replication factor of 3, you can schedule a mapper on 3 different nodes. But, with a factor of 5, you will have 5 choices
You may be able to achieve better data locality, with increase in the replication factor. Each of the mapper could get scheduled on the same node where the data is present (since now there are 5 choices compared to the default 3), thus improving the performance.
Since there is a better data locality, lesser number of mappers will copy off-node or off-rack data

Due to these reasons, its possible that, with a higher replication factor, the mappers could complete earlier than with a lower replication factor.

Since typically the number of mappers are always higher than the number of reducers, you may see an overall improvement in your job performance.

Reducer:

Since the output of the reducer directly gets written into HDFS, its possible that your reducers will take more time to execute, with a higher replication factor.

Overall, your mappers may execute faster with a higher replication factor. But, actual performance improvement depends on various factors like, the size of your cluster, bandwidth, NameNode memory etc.

After answering this question, I came across another similar question in SO here: Map Job Performance on cluster. This also contains some more information, with links to various research papers.

Hadoop Optimization Suggestion

Answers (2)

Related Questions