Reputation: 99
I need to run a custom C++ job as a Map Reduce on Amazon, and was planning to use Hadoop streaming for this. The C++ mapper executable relies on dozens of custom libraries, some of which are time-consuming to build.
I expected EMR to support custom AMIs (already have one built). However, after a careful look at the documentation it seems that it is only possible to run EMR on predefined images: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html.
Am I missing something? If, indeed, only predefined AMIs are supported, what is the best option for getting this to run? The executable, obviously, is on s3, but can I actually bundle it up so that it depends on no shared libs at all?
Thanks.
Upvotes: 6
Views: 3378
Reputation: 1836
Custom AMIs is indeed a very interesting use case. One option for you would be to use Qubole which offers inbuilt support for custom built AMIs where in you case install all your necessary libraries coupled with all cool features of Qubole like support for autoscaling, spot instances and much more!
Disclaimer: I work for Qubole.
Upvotes: 2
Reputation: 14915
You are correct, because of the many software tools and configurations required on an Hadoop cluster node, only Amazon provided AMI are allowed on EMR. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html
You can use standard bootstrapping techniques to install any additional software you require to run on your cluster. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html to learn more about bootstrap actions.
Back to your use case : Why is it taking so long to bootstrap in your use case ? Because there are many packages ? Because you're compiling them from source ?
In the latter case, it might be worth to build your .deb packages and to install them from a custom repository to speed up bootstrap process.
If it just because you have many packages to install, I am afraid there is no obvious solution today. I can think about EBS snapshots and volumes being created and attached during bootstrap - but the feasibility of this really depends on your use case.
Upvotes: 4
Reputation: 171
I am also investigating the same. Based on the first look at the documentation best option to achieve this is by doing custom bootstrap options.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html
However for us it will take 15-20 mins to run the custom script. I am hoping there is a way to customize the AMI and add required software into AMI instead of installing it on every node when they are coming up.
Upvotes: 1