Reputation: 170
We are in the process of automating the launch of on demand EMR clusters. This will be triggered upon the arrival of certain files in AWS S3. In this regard, we are evaluating two options - 1. Shell script that will invoke a AWS CLI to launch the desired EMR cluster 2. Python script that will invoke methods for EMR start, stop using the boto3
Is there any preference of using one option over the other? The former appears easier, as we can take the CLI from the manually created EMRs from the AWS console and package it into a shell script. While the later option has intricacies and doesn't have such a starting point and the methods would have to be written from scratch.
Appreciate your inputs in this regard.
Upvotes: 0
Views: 1341
Reputation: 621
While both can achieve what you want, I would suggest to go with Lambda (Python).
Create an event trigger on the S3 location where data is expected - this will invoke your lambda (python code) and lambda can in-turn launch your EMR.
s3-> lambda -> EMR
Another option could be to trigger a data pipeline from lambda which will create the EMR for you.
s3 -> lambda -> pipeline -> EMR
Advantages of using pipeline vs lambda to create EMR
You can read more about it here - https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
Upvotes: 1