Reputation: 7951
We are going to do the ingestion phase in our data lake project and I have mostly used hadoop fs -put
throughout my Hadoop developer experience. So what's the difference with hadoop distcp
and the difference in usage?
Upvotes: 7
Views: 5338
Reputation: 11
Distcp is command is used for coying the data from one cluster's hdfs location to another cluster's hdfs location only. create MapReduce jobs with 0 reducer for processing the data.
hadoop -distcp webhdfs://source-ip/directory/filename webhdfs://target-ip/directory/
scp is the command used for copying the data from one cluster's local file system to another cluster's local file system.
scp //source-ip/directory/filename //target-ip/directory/
hdfs put command - copies the data from local file system to hdfs. Does not create MapReduce jobs for processing the data.
hadoop fs -put -f /path/file /hdfspath/file
hdfs get command -copies the data from hdfs to local file system
first, go to the directory where you want to copy the file then run below command
hadoop fs -get /hdfsloc/file
Upvotes: 1
Reputation: 100
"distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem" -> it can, use "file" (eg. "file:///tmp/test.txt") as schema in URL (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html)
Hint: use "hadoop distcp -D dfs.replication=1" to decrease distcp process time during copy operation and later replicate the copied files.
Upvotes: 0
Reputation: 83
hdfs or hadoop put is used for data ingestion from Local to HDFS file system
distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem
We extensively use distcp for (Archiving ) Back-up and Restore of the HDFS files something like this
hadoop distcp $CURRENT_HDFS_PATH $BACKUP_HDFS_PATH
Upvotes: 1
Reputation: 8937
Distcp is a special tool used for copying the data from one cluster to another. Basically you usually copy from one hdfs to hdfs, but not for local file system. Another very important thing is that the process in done as a mapreduce job with 0 reduce task which makes it more fast due to the distribution of operations. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list
hdfs put - copies the data from local system to hdfs. Uses hdfs client for this behind the scene and does all the work sequentially through accessing NameNode and Datanodes. Does not create MapReduce jobs for processing the data.
Upvotes: 9