Reputation: 35
What is the fastest way to copy files in HDFS in programmatic way ? I have tried for DistCp but couldn't get the appropriate content
Upvotes: 0
Views: 4008
Reputation: 34184
distcp works perfectly fine for both localfFS to HDFS and HDFS to HDFS copying. However, it doesn't provide us the benefit of high parallelism of MapReduce since the input data resides on localFS(a non-distributes store) and not on HDFS. So, using either of the two will give you almost the same performance, which obviously depends on the hardware and size of input data.
BTW, what do you mean by DistCp but couldn't get the appropriate content?
Upvotes: 2
Reputation: 1855
Distcp is certainly the fastest way to copy large amount of data over HDFS. I would suggest to try first from the command line before calling if from your favorite programming language.
hadoop distcp -p -update "hdfs://A:8020/user/foo/bar" "hdfs://B:8020/user/foo/baz"
-p to preserve status, -update to overwrite data if a file is already present but has a different size.
Since Distcp is written in Java, you shouldn't have any difficulty to call it from a Java application. You can also use your favorite script language (Python, bash, etc.) to run hadoop distcp like any other command line application.
Upvotes: 0
Reputation: 1399
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path("/home/me/localdirectory/"), new Path("/me/hadoop/hdfsdir"));
DistCp works only intra-cluster (from hdfs to hdfs).
Upvotes: 0