Combining AWS EMR output

Question

I ran a test AWS EMR job with a custom mapper but with NONE as the reducer. I got the (expected) output in 13 separate "part" files. How can I combine them into a single file?

I don't need to aggregate data in any special way, and I don't care if it is sorted, re-ordered arbitrarily, or left in order. But I would like to efficiently put the data back into a single file. Do I have to do that manually, or is there a way to do it as part of the EMR Cluster?

It's very strange to me that there isn't a default option or some sort of automatic step available for this. I've read a bit about the Identity Reducer. Does it do what I want, and if so, how do I use it when launching a cluster through the EMR console?

My data is in S3.

EDIT

To be very clear, I can run cat on all of the output parts after the job is done, if that's what I have to do. Locally, or on an EC2 instance, or whatever. Is that really what everyone does?

Ashrith · Accepted Answer

If the output of the mapper part files itself are small then you could try using hadoop fs -getmerge to merge them to local filesystem:

hadoop fs -getmerge s3n://BUCKET/path/to/output/ [LOCAL_FILE]

And then put the merged file back to S3:

hadoop fs -put [LOCAL_FILE] s3n://BUCKET/path/to/put/

For the above commands to work you should have the following properties set in core-site.xml


  fs.s3n.awsAccessKeyId
  YOUR_ACCESS_KEY



  fs.s3n.awsSecretAccessKey
  YOUR_SECRET_ACCESS_KEY

Combining AWS EMR output

Answers (1)

Related Questions