Water
Water

Reputation: 147

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.

I have uploaded the data file to S3.

How can I access that data while running my job in EMR?
If I just specify the file path as:

s3n://<bucket-name>/path

in the code, will that work ?

Thanks

Upvotes: 0

Views: 492

Answers (2)

Water
Water

Reputation: 147

What I ended up doing:

1) Wrote a small script that copies my file from s3 to the cluster

hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt  $DESTINATION_DIR_ON_HOST

2) Created bootstrap step for my EMR Job, that runs the script in 1).

This approach doesn't require to make the S3 data public.

Upvotes: 0

user1452132
user1452132

Reputation: 1768

S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.

A quick search for an example program brought up this link. https://gist.github.com/lucastex/917988

You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.

Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

Upvotes: 1

Related Questions