Reputation: 117
How may I make my PySpark code to run with AWS EMR from AWS Lambda? Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?
Upvotes: 0
Views: 2985
Reputation: 5526
You need transient cluster for this case which will auto terminate once your job is completed or the timeout is reached whichever occurs first.
You can access this link on how to initialise the same.
Upvotes: 1
Reputation: 1410
What are the processes available to create a EMR cluster:
- Using boto3 / AWS CLI / Java SDK
- Using cloudformation
- Using Data Pipeline
Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?
No. It isn’t mandatory to use lambda to create an auto-terminating cluster.
You just need to specify a flag
--auto-terminate
while creating a cluster using boto3 / CLi / Java-SDK. But this case you need to submit the job along with cluster config. RefNote:
Its not possible to create an auto-terminating cluster using cloudformation. By design, CloudFormation assumes that the resources that are being created will be permanent to some extent.
If you REALLY had to do it this way, you could make an AWS api call to delete the CF stack upon finishing your EMR tasks.
How may I make my PySpark code to run with AWS EMR from AWS Lambda?
You can design your lambda to submit spark job. You can find an example here
In my use case I have one parameterised lambda which invoke CF to create cluster, submit job and terminate cluster.
Upvotes: 0