Reputation: 2939
I am using S3DistCp to copy content from S3 to Amazon EMR HDFS. For some jobs I am running out of space and expect to solve this issue by reducing replication factor. But I do not see way to achieve this at job level. Can someone help on this issue?
Upvotes: 2
Views: 945
Reputation: 270029
You would not normally want to modify a cluster's replication factor on a job-by-job basis. Replication is used for data redundancy (in case of failure) and to improve performance (by having data closer to the compute operations). It's best to leave the cluster at a pre-defined value.
By default, Amazon EMR sets the default replication factor to 1 for 1-3 core nodes, the value to 2 for 4-9 core nodes, and the value to 3 for 10+ core nodes.
You could theoretically change the dfs.replication
setting, but it's probably not the best way to solve your current problem.
Upvotes: 3