Process entire files using Hadoop streaming on Amazon EMR

Question

I have a directory full of gzipped text files on Amazon S3, and I'm trying to use Hadoop streaming on Amazon Elastic MapReduce to apply a function to each file individually (specifically, parse a multi-line header). The default Hadoop streaming "each line is a record" format does not work here.

My attempt was to set -input to a text file listing the S3 path of each gzipped file and then use 'Hadoop fs -get" or "Hadoop fs -copyToLocal" in the mapper to copy the file to the worker node, and then run functions on the whole file. However, doing this causes the step to fail with a "permission denied" error.

I'm guessing that this has something to do with the dfs.permissions.enabled variable, but I'm not having any luck passing these through the Hadoop setup bootstrap interface.

Anyone have an idea what's causing the error and how to fix it? Alternatively, if there is some other method for applying functions to entire files using EMR (or some other Amazon tool), I'm open to those as well. Thanks!

Process entire files using Hadoop streaming on Amazon EMR

Answers (1)

Related Questions