Using data present in S3 inside EMR mappers

Question

I need to access some data during the map stage. It is a static file, from which I need to read some data.

I have uploaded the data file to S3.

How can I access that data while running my job in EMR?
If I just specify the file path as:

s3n:///path

in the code, will that work ?

Thanks

Water · Accepted Answer

What I ended up doing:

1) Wrote a small script that copies my file from s3 to the cluster

hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt  $DESTINATION_DIR_ON_HOST

2) Created bootstrap step for my EMR Job, that runs the script in 1).

This approach doesn't require to make the S3 data public.

Answers (2)