Reputation: 29
I am trying to run an EMR (1 master and 2 core nodes) step with a very simple python script that i uploaded to s3 to be used in EMR spark application step. This script reads a data.txt file in S3 and saves it back, and it can be seen below,
import pyspark
import boto3
sc = SparkContext()
text_file = sc.textFile('s3://First_bucket/data.txt')
text_file.repartition(1).saveAsTextFile('s3://First_bucket/logdata')
sc.stop()
However, this straight script does not cause a bug when import boto3 is not used. To fix this problem i have tried to add a bootstrap action with boto.sh file while I am creating my EMR cluster. The boto.sh file that i used is as the follow,
#!/bin/bash
sudo easy_install-3.6 pip
sudo pip install --target /usr/lib/spark/python/ boto3
Unfortunately, this just enabled boto3 library on master node not core nodes. Again the EMR step of doing this is failed, and the error log file is:
2020-02-08T20:56:49.698Z INFO Ensure step 4 jar file command-runner.jar
2020-02-08T20:56:49.699Z INFO StepRunner: Created Runner for step 4
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster s3://First_bucket/data.py'
INFO Environment:
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
LESS_TERMCAP_md=[01;38;5;208m
LESS_TERMCAP_me=[0m
HISTCONTROL=ignoredups
LESS_TERMCAP_mb=[01;31m
AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
UPSTART_JOB=rc
LESS_TERMCAP_se=[0m
HISTSIZE=1000
HADOOP_ROOT_LOGGER=INFO,DRFA
JAVA_HOME=/etc/alternatives/jre
AWS_DEFAULT_REGION=eu-central-1
AWS_ELB_HOME=/opt/aws/apitools/elb
LESS_TERMCAP_us=[04;38;5;111m
EC2_HOME=/opt/aws/apitools/ec2
TERM=linux
runlevel=3
LANG=en_US.UTF-8
AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
MAIL=/var/spool/mail/hadoop
LESS_TERMCAP_ue=[0m
LOGNAME=hadoop
PWD=/
LANGSH_SOURCED=1
HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-2V51S7I25TLLW/tmp
_=/etc/alternatives/jre/bin/java
CONSOLETYPE=serial
RUNLEVEL=3
LESSOPEN=||/usr/bin/lesspipe.sh %s
previous=N
UPSTART_EVENTS=runlevel
AWS_PATH=/opt/aws
USER=hadoop
UPSTART_INSTANCE=
PREVLEVEL=N
HADOOP_LOGFILE=syslog
PYTHON_INSTALL_LAYOUT=amzn
HOSTNAME=ip-***-***-***-***
HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-2V51S7I25TLLW
EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
EMR_STEP_ID=s-2V51S7I25TLLW
SHLVL=5
HOME=/home/hadoop
HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-2V51S7I25TLLW/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-2V51S7I25TLLW/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-2V51S7I25TLLW
INFO ProcessRunner started child process 22893
2020-02-08T20:56:49.705Z INFO HadoopJarStepRunner.Runner: startRun() called for s-2V51S7I25TLLW Child Pid: 22893
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 26 seconds
2020-02-08T20:57:15.787Z INFO Step created jobs:
2020-02-08T20:57:15.787Z WARN Step failed with exitCode 1 and took 26 seconds
My question is how to use EMR spark application step with python script that contains libraries such as boto3. Thank in advance.
Upvotes: 1
Views: 1957
Reputation: 862
The answer is Bootstrap actions
While creating the cluster and by adding a bootstrap action[1], you will be able to install the boto3 package. Otherwise, for a running cluster you will need to install boto3 on all nodes manually by connect to nodes or using Chef, ansible,...
The bootstrap action will be like:
sudo pip-3.6 install boto3
Or
sudo pip install boto3
Note: Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data.
The logs of running boostrap action will be located in '/mnt/var/log/bootstrap-actions' on all nodes.
[1]- Create Bootstrap Actions to Install Additional Software - https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html
Upvotes: 3