Harvesting Data
Harvesting Data

Reputation: 170

Automation of on-demand AWS EMR cluster - Using Python (boto3) over AWS CLI

We are in the process of automating the launch of on demand EMR clusters. This will be triggered upon the arrival of certain files in AWS S3. In this regard, we are evaluating two options - 1. Shell script that will invoke a AWS CLI to launch the desired EMR cluster 2. Python script that will invoke methods for EMR start, stop using the boto3

Is there any preference of using one option over the other? The former appears easier, as we can take the CLI from the manually created EMRs from the AWS console and package it into a shell script. While the later option has intricacies and doesn't have such a starting point and the methods would have to be written from scratch.

Appreciate your inputs in this regard.

Upvotes: 0

Views: 1341

Answers (1)

abiydv
abiydv

Reputation: 621

While both can achieve what you want, I would suggest to go with Lambda (Python).

Create an event trigger on the S3 location where data is expected - this will invoke your lambda (python code) and lambda can in-turn launch your EMR.

s3-> lambda -> EMR

Another option could be to trigger a data pipeline from lambda which will create the EMR for you.

s3 -> lambda -> pipeline -> EMR

Advantages of using pipeline vs lambda to create EMR

  • GUI based: You can pick and choose the components needed like resources, activites, schedules etc.
  • Minimal Python: In the lambda you will just configure the pipeline to be triggered, you don't need to implement error handling, retries, success or failure emails etc. All of this is inbuilt in the pipelines
  • Flexible: Since pipeline components are modular and configurable, you can change any configuration quickly. Code changes often takes more time.

You can read more about it here - https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html

Upvotes: 1

Related Questions