oikonomiyaki
oikonomiyaki

Reputation: 7951

Difference between hadoop fs -put and hadoop distcp

We are going to do the ingestion phase in our data lake project and I have mostly used hadoop fs -put throughout my Hadoop developer experience. So what's the difference with hadoop distcp and the difference in usage?

Upvotes: 7

Views: 5338

Answers (4)

Sahil Agnihotri
Sahil Agnihotri

Reputation: 11

Distcp is command is used for coying the data from one cluster's hdfs location to another cluster's hdfs location only. create MapReduce jobs with 0 reducer for processing the data.

hadoop -distcp webhdfs://source-ip/directory/filename webhdfs://target-ip/directory/

scp is the command used for copying the data from one cluster's local file system to another cluster's local file system.

scp //source-ip/directory/filename //target-ip/directory/

hdfs put command - copies the data from local file system to hdfs. Does not create MapReduce jobs for processing the data.

hadoop fs -put -f /path/file /hdfspath/file

hdfs get command -copies the data from hdfs to local file system

first, go to the directory where you want to copy the file then run below command

hadoop fs -get /hdfsloc/file

Upvotes: 1

matz3
matz3

Reputation: 100

"distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem" -> it can, use "file" (eg. "file:///tmp/test.txt") as schema in URL (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html)

Hint: use "hadoop distcp -D dfs.replication=1" to decrease distcp process time during copy operation and later replicate the copied files.

Upvotes: 0

Yogesh_JavaJ2EE
Yogesh_JavaJ2EE

Reputation: 83

hdfs or hadoop put is used for data ingestion from Local to HDFS file system

distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem

We extensively use distcp for (Archiving ) Back-up and Restore of the HDFS files something like this

hadoop distcp $CURRENT_HDFS_PATH $BACKUP_HDFS_PATH

Upvotes: 1

Alex
Alex

Reputation: 8937

Distcp is a special tool used for copying the data from one cluster to another. Basically you usually copy from one hdfs to hdfs, but not for local file system. Another very important thing is that the process in done as a mapreduce job with 0 reduce task which makes it more fast due to the distribution of operations. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list

hdfs put - copies the data from local system to hdfs. Uses hdfs client for this behind the scene and does all the work sequentially through accessing NameNode and Datanodes. Does not create MapReduce jobs for processing the data.

Upvotes: 9

Related Questions