How do I copy a file from S3 to Amazon EMR in Data Pipeline after EMR is provisioned?

I'm creating a Data Pipeline in AWS to run a Pig task. But my Pig task requires additional file in EMR. How do I tell Data Pipeline to copy a file to EMR after the cluster is created and before the pig tasked is run?

I need to just run these two commands.

hdfs dfs -mkdir /somefolder
hdfs dfs -put somefile_from_s3 /somefoler/

Upvotes: 0

Answers (3)

Durga Viswanath Gadiraju

Reputation: 3956

You can add step while creating the cluster as demonstrated here

You can configure hive or pig script to perform the copy. You can add steps using command line as well.

Upvotes: 1

Steve McPherson

Reputation: 1

Also, have you tried using 's3://' directly? In most cases you can use s3 as a native Hadoop file system via the 's3://' scheme. This avoids the need to copy data from S3 to HDFS.

--
-- import logs and break into tuples
--
raw_logs =  -- load the weblogs into a sequence of one element tuples
  LOAD 's3://elasticmapreduce/samples/pig-apache/input' USING TextLoader AS (line:chararray);

Upvotes: 0

Austin Lee

Reputation: 84

If you have the option of modifying your Pig script, you can run the mkdir and put commands at the top of your script (https://pig.apache.org/docs/r0.9.1/cmds.html).

Otherwise, you can use a ShellCommandActivity which runs on the EmrCluster and execute those commands before your PigActivity runs. This option has a drawback because if the ShellCommandActivity succeeds, but the PigActivity fails, simply rerunning the PigActivity again won't get you the files you need for the activity to run, meaning the whole pipeline would have to be rerun. So, I would recommend the first solution.

I would be happy to produce a working sample for you either way. Please let me know which solution you would like to see.

Thanks.

Upvotes: 1

How do I copy a file from S3 to Amazon EMR in Data Pipeline after EMR is provisioned?

Answers (3)

Related Questions