Reputation: 247
I would like to trigger EMR spark job with python code through AWS Lambda after trigger the s3 event.I appreciate if any one can share the configuration/command to invoke the EMR spark job from AWS Lambda function.
Upvotes: 1
Views: 1592
Reputation: 1352
Since this question is very generic, I will try to give an example code for doing this. You will have to change certain parameters based upon your actual value.
The way I generally do this is I place the main handler function in one file say named as lambda_handler.py
and all the configuration and steps of the EMR in a file named as emr_configuration_and_steps.py
.
Please check the code snippet below for lambda_handler.py
import boto3
import emr_configuration_and_steps
import logging
import traceback
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
formatter = logging.Formatter('%(levelname)s:%(name)s:%(message)s')
def create_emr(name):
try:
emr = boto3.client('emr')
cluster_id = emr.run_job_flow(
Name=name,
VisibleToAllUsers=emr_configuration_and_steps.visible_to_all_users,
LogUri=emr_configuration_and_steps.log_uri,
ReleaseLabel=emr_configuration_and_steps.release_label,
Applications=emr_configuration_and_steps.applications,
Tags=emr_configuration_and_steps.tags,
Instances=emr_configuration_and_steps.instances,
Steps=emr_configuration_and_steps.steps,
Configurations=emr_configuration_and_steps.configurations,
ScaleDownBehavior=emr_configuration_and_steps.scale_down_behavior,
ServiceRole=emr_configuration_and_steps.service_role,
JobFlowRole=emr_configuration_and_steps.job_flow_role
)
logger.info("EMR is created successfully")
return cluster_id['JobFlowId']
except Exception as e:
traceback.print_exc()
raise Exception(e)
def lambda_handler(event, context):
logger.info("starting the lambda function for spawning EMR")
try:
emr_cluster_id = create_emr('Name of Your EMR')
logger.info("emr_cluster_id is = " + emr_cluster_id)
except Exception as e:
logger.error("Exception at some step in the process " + str(e))
Now the second file(emr_configuration_and_steps.py
) that has all the configuration would look like this.
visible_to_all_users = True
log_uri = 's3://your-s3-log-path-here/'
release_label = 'emr-5.29.0'
applications = [{'Name': 'Spark'}, {'Name': 'Hadoop'}]
tags = [
{'Key': 'Project', 'Value': 'Your-Project Name'},
{'Key': 'Service', 'Value': 'Your-Service Name'},
{'Key': 'Environment', 'Value': 'Development'}
]
instances = {
'Ec2KeyName': 'Your-key-name',
'Ec2SubnetId': 'your-subnet-name',
'InstanceFleets': [
{
"InstanceFleetType": "MASTER",
"TargetOnDemandCapacity": 1,
"TargetSpotCapacity": 0,
"InstanceTypeConfigs": [
{
"WeightedCapacity": 1,
"BidPriceAsPercentageOfOnDemandPrice": 100,
"InstanceType": "m3.xlarge"
}
],
"Name": "Master Node"
},
{
"InstanceFleetType": "CORE",
"TargetSpotCapacity": 8,
"InstanceTypeConfigs": [
{
"WeightedCapacity": 8,
"BidPriceAsPercentageOfOnDemandPrice": 50,
"InstanceType": "m3.xlarge"
}
],
"Name": "Core Node"
},
],
'KeepJobFlowAliveWhenNoSteps': False
}
steps = [
{
'Name': 'Setup Hadoop Debugging',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['state-pusher-script']
}
},
{
"Name": "Active Marker for digital panel",
"ActionOnFailure": 'TERMINATE_CLUSTER',
'HadoopJarStep': {
"Jar": "command-runner.jar",
"Args": [
"spark-submit",
"--deploy-mode",
"cluster",
"--driver-memory", "4g",
"--executor-memory", "4g",
"--executor-cores", "2",
"--class", "your-main-class-full-path-name",
"s3://your-jar-path-SNAPSHOT-jar-with-dependencies.jar"
]
}
}
]
configurations = [
{
"Classification": "spark-log4j",
"Properties": {
"log4j.logger.root": "INFO",
"log4j.logger.org": "INFO",
"log4j.logger.com": "INFO"
}
}
]
scale_down_behavior = 'TERMINATE_AT_TASK_COMPLETION'
service_role = 'EMR_DefaultRole'
job_flow_role = 'EMR_EC2_DefaultRole'
Please adjust the certain path and name according to your use case. To deploy this you need to install boto3 and package/zip these 2 files in a zip file and upload this to your lambda function. By this you should be able to spawn the EMR.
Upvotes: 2