Sample_friend
Sample_friend

Reputation: 117

How to run PySpark on AWS EMR with AWS Lambda

How may I make my PySpark code to run with AWS EMR from AWS Lambda? Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?

Upvotes: 0

Views: 2985

Answers (2)

Shubham Jain
Shubham Jain

Reputation: 5526

You need transient cluster for this case which will auto terminate once your job is completed or the timeout is reached whichever occurs first.

You can access this link on how to initialise the same.

Upvotes: 1

SnigJi
SnigJi

Reputation: 1410

What are the processes available to create a EMR cluster:

  1. Using boto3 / AWS CLI / Java SDK
  2. Using cloudformation
  3. Using Data Pipeline

Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?

No. It isn’t mandatory to use lambda to create an auto-terminating cluster.

You just need to specify a flag --auto-terminate while creating a cluster using boto3 / CLi / Java-SDK. But this case you need to submit the job along with cluster config. Ref

Note:

Its not possible to create an auto-terminating cluster using cloudformation. By design, CloudFormation assumes that the resources that are being created will be permanent to some extent.

If you REALLY had to do it this way, you could make an AWS api call to delete the CF stack upon finishing your EMR tasks.

How may I make my PySpark code to run with AWS EMR from AWS Lambda?

You can design your lambda to submit spark job. You can find an example here

In my use case I have one parameterised lambda which invoke CF to create cluster, submit job and terminate cluster.

Upvotes: 0

Related Questions