scalauser
scalauser

Reputation: 1327

Copy files from S3 to HDFS using distcp or s3distcp

I am trying to copy files from S3 to HDFS using the following command:

hadoop distcp s3n://bucketname/filename hdfs://namenodeip/directory

However this is not working, getting an error as following:

ERROR tools.DistCp: Exception encountered 
java.lang.IllegalArgumentException: Invalid hostname in URI

I have tried to add the S3 keys in hadoop conf.xml, and it is also not working. Please help me the appropriate step by step procedure to achieve the file copy from S3 to HDFS.

Thanks in advance.

Upvotes: 2

Views: 27527

Answers (2)

scalauser
scalauser

Reputation: 1327

The command should be like this :

Hadoop distcp s3n://bucketname/directoryname/test.csv /user/myuser/mydirectory/

This will copy test.csv file from S3 to a HDFS directory called /mydirectory in the specified HDFS path. In this S3 file system is being used in a native mode. More details can be found on http://wiki.apache.org/hadoop/AmazonS3

Upvotes: 7

Sathish
Sathish

Reputation: 3557

Copy log files stored in an Amazon S3 bucket into HDFS. Here --srcPattern option is used to limit the data copied to the daemon logs.

Linux, UNIX, and Mac OS X users:

./elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'

Windows users:

ruby elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,--dest,hdfs:///output,--srcPattern,.*daemons.*-hadoop-.*'

Please check this link for more :
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

Hope this helps!

Upvotes: 1

Related Questions