Reputation: 8365
I ran a test AWS EMR job with a custom mapper but with NONE as the reducer. I got the (expected) output in 13 separate "part" files. How can I combine them into a single file?
I don't need to aggregate data in any special way, and I don't care if it is sorted, re-ordered arbitrarily, or left in order. But I would like to efficiently put the data back into a single file. Do I have to do that manually, or is there a way to do it as part of the EMR Cluster?
It's very strange to me that there isn't a default option or some sort of automatic step available for this. I've read a bit about the Identity Reducer. Does it do what I want, and if so, how do I use it when launching a cluster through the EMR console?
My data is in S3.
EDIT
To be very clear, I can run cat
on all of the output parts after the job is done, if that's what I have to do. Locally, or on an EC2 instance, or whatever. Is that really what everyone does?
Upvotes: 5
Views: 1822
Reputation: 6855
If the output of the mapper part files itself are small then you could try using hadoop fs -getmerge
to merge them to local filesystem:
hadoop fs -getmerge s3n://BUCKET/path/to/output/ [LOCAL_FILE]
And then put the merged file back to S3:
hadoop fs -put [LOCAL_FILE] s3n://BUCKET/path/to/put/
For the above commands to work you should have the following properties set in core-site.xml
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>YOUR_SECRET_ACCESS_KEY</value>
</property>
Upvotes: 3