Reputation: 313
I'm trying to spin up an EMR cluster with a Spark step using a Lambda function.
Here is my lambda function (python 2.7):
import boto3
def lambda_handler(event, context):
conn = boto3.client("emr")
cluster_id = conn.run_job_flow(
Name='LSR Batch Testrun',
ServiceRole='EMR_DefaultRole',
JobFlowRole='EMR_EC2_DefaultRole',
VisibleToAllUsers=True,
LogUri='s3n://aws-logs-171256445476-ap-southeast-2/elasticmapreduce/',
ReleaseLabel='emr-5.16.0',
Instances={
"Ec2SubnetId": "<my-subnet>",
'InstanceGroups': [
{
'Name': 'Master nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'm3.xlarge',
'InstanceCount': 1,
},
{
'Name': 'Slave nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'CORE',
'InstanceType': 'm3.xlarge',
'InstanceCount': 2,
}
],
'KeepJobFlowAliveWhenNoSteps': False,
'TerminationProtected': False
},
Applications=[{
'Name': 'Spark',
'Name': 'Hive'
}],
Configurations=[
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
},
{
"Classification": "spark-hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
],
Steps=[{
'Name': 'mystep',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 's3://elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': [
"/home/hadoop/spark/bin/spark-submit", "--deploy-mode", "cluster",
"--master", "yarn-cluster", "--class", "org.apache.spark.examples.SparkPi",
"s3://support.elasticmapreduce/spark/1.2.0/spark-examples-1.2.0-hadoop2.4.0.jar", "10"
]
}
}],
)
return "Started cluster {}".format(cluster_id)
The cluster is starting up, but when trying to execute the step it fails. The error log is containing the following exception:
Exception in thread "main" java.lang.RuntimeException: Local file does not exist.
at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.fetchFile(ScriptRunner.java:30)
at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.main(ScriptRunner.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
So it seems like the script-runner is not understanding to pick up the .jar file from S3?
Any help appreciated...
Upvotes: 3
Views: 5856
Reputation: 313
I could solve the problem eventually. Main problem was the broken "Applications" configuration, which has to look like the following instead:
Applications=[{
'Name': 'Spark'
},
{
'Name': 'Hive'
}],
The final Steps element:
Steps=[{
'Name': 'lsr-step1',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
"spark-submit", "--class", "org.apache.spark.examples.SparkPi",
"s3://support.elasticmapreduce/spark/1.2.0/spark-examples-1.2.0-hadoop2.4.0.jar", "10"
]
}
}]
Upvotes: 3
Reputation: 813
Not all EMR pre-built with ability to copy your jar, script from S3 so you must do that in bootstrap steps:
BootstrapActions=[
{
'Name': 'Install additional components',
'ScriptBootstrapAction': {
'Path': code_dir + '/scripts' + '/emr_bootstrap.sh'
}
}
],
And here is what my bootstrap does
#!/bin/bash
HADOOP="/home/hadoop"
BUCKET="s3://<yourbucket>/<path>"
# Sync jars libraries
aws s3 sync ${BUCKET}/jars/ ${HADOOP}/
aws s3 sync ${BUCKET}/scripts/ ${HADOOP}/
# Install python packages
sudo pip install --upgrade pip
sudo ln -s /usr/local/bin/pip /usr/bin/pip
sudo pip install psycopg2 numpy boto3 pythonds
Then you can call your script and jar like this
{
'Name': 'START YOUR STEP',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
"spark-submit", "--jars", ADDITIONAL_JARS,
"--py-files", "/home/hadoop/modules.zip",
"/home/hadoop/<your code>.py"
]
}
},
Upvotes: 1