Reputation: 8374
I want to execute a on-demand ETL job, using AWS architecture.
This ETL process is going to run daily, and I don't want to pay for a EC2 instance all the time. This ETL job can be written in python, for example.
I know that in EMR, I can build my cluster on-demand and execute a hadoop job.
What is the best architecture to run a simple on-demand ETL job?
Upvotes: 0
Views: 746
Reputation: 53
For on demand ETL job you can use AWS Lambda to trigger the lambda function which would contain the start job code for your ETL job. AWS Lambda can be triggered through other AWS services such as: S3, CloudWatch(trigger according mentioned time), SNS etc.
You can use boto3 (http://boto3.readthedocs.io/en/latest/) SDK if you are planning to use python based AWS Lambda function to access the AWS services which also include AWS Glue.
Upvotes: 0
Reputation: 83
Now you can put your script on AWS Lambda for ETL. It supports scheduler and Trigger on other AWS components. It is on-demand and will charge you only when the Lambda function got executed.
Upvotes: 0
Reputation: 269861
The simplest would be to launch an Amazon EC2 instance and trigger the ETL job as part of the User Data. A script passed via User Data is automatically executed when the instance is launched.
If you want to get creative, you could launch the instance using Spot Pricing. Launch using a high spot price (to ensure it runs), but it's likely you'll only have to pay a low price based on the current spot market.
Upvotes: 1
Reputation: 128
(I am an employee of Qubole) If you are going to use Hadoop to run your Python scripts, then Qubole manages the cluster for you. It will start your clusters when a job is submitted and shutdown when the cluster is idle for a long time. More details are available in the faq: http://docs.qubole.com/en/latest/faqs/hadoop-clusters/clusters-brought-shutdown.html
Upvotes: 0