user3923714
user3923714

Reputation: 1

Process entire files using Hadoop streaming on Amazon EMR

I have a directory full of gzipped text files on Amazon S3, and I'm trying to use Hadoop streaming on Amazon Elastic MapReduce to apply a function to each file individually (specifically, parse a multi-line header). The default Hadoop streaming "each line is a record" format does not work here.

My attempt was to set -input to a text file listing the S3 path of each gzipped file and then use 'Hadoop fs -get" or "Hadoop fs -copyToLocal" in the mapper to copy the file to the worker node, and then run functions on the whole file. However, doing this causes the step to fail with a "permission denied" error.

I'm guessing that this has something to do with the dfs.permissions.enabled variable, but I'm not having any luck passing these through the Hadoop setup bootstrap interface.

Anyone have an idea what's causing the error and how to fix it? Alternatively, if there is some other method for applying functions to entire files using EMR (or some other Amazon tool), I'm open to those as well. Thanks!

Upvotes: 0

Views: 185

Answers (1)

programmerbyheart
programmerbyheart

Reputation: 99

It could be due to limited permission on folder on worker node where you are copying file. Please check permission.

Also, it would help if you share full log.

Upvotes: 1

Related Questions