kamoor
kamoor

Reputation: 2939

Set HDFS replication factor while running S3DistCp

I am using S3DistCp to copy content from S3 to Amazon EMR HDFS. For some jobs I am running out of space and expect to solve this issue by reducing replication factor. But I do not see way to achieve this at job level. Can someone help on this issue?

Upvotes: 2

Views: 945

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 270029

You would not normally want to modify a cluster's replication factor on a job-by-job basis. Replication is used for data redundancy (in case of failure) and to improve performance (by having data closer to the compute operations). It's best to leave the cluster at a pre-defined value.

By default, Amazon EMR sets the default replication factor to 1 for 1-3 core nodes, the value to 2 for 4-9 core nodes, and the value to 3 for 10+ core nodes.

You could theoretically change the dfs.replication setting, but it's probably not the best way to solve your current problem.

Upvotes: 3

Related Questions