Reputation: 1355
I am working with an AWS Data Pipeline that has a ShellCommandActivity that sets the script uri to bash file located in a s3 bucket. The bash file copies a python script located in the same s3 bucket to a EmrCluster and then the script tries to execute that python script.
This is my pipeline export:
{
"objects": [
{
"name": "DefaultResource1",
"id": "ResourceId_27dLM",
"amiVersion": "3.9.0",
"type": "EmrCluster",
"region": "us-east-1"
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://project/bin/scripts/logs/",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"stage": "true",
"scriptUri": "s3://project/bin/scripts/RunPython.sh",
"name": "DefaultShellCommandActivity1",
"id": "ShellCommandActivityId_hA57k",
"runsOn": {
"ref": "ResourceId_27dLM"
},
"type": "ShellCommandActivity"
}
],
"parameters": []
}
This is RunPython.sh:
#!/usr/bin/env bash
aws s3 cp s3://project/bin/scripts/Test.py ./
python ./Test.py
This is Test.py
__author__ = 'MrRobot'
import re
import os
import sys
import boto3
print "We've entered the python file"
From the Stdout Log I get:
download: s3://project/bin/scripts/Test.py to ./
From the Stdeer Log I get:
python: can't open file 'Test.py': [Errno 2] No such file or directory
I have also tried replacing python ./Test.py with python Test.py, but I get the same result.
How do I get my AWS Data Pipeline to execute my Test.py script.
EDIT
When I set scriptUri to s3://project/bin/scripts/Test.py I get the following errors :
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 1: author: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 2: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 3: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 4: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 5: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 7: print: command not found
EDIT 2
Added the following line to Test.py
#!/usr/bin/env python
Then I received the following error:
error: line 6, in import boto3 ImportError: No module named boto3
using @franklinsijo 's advice I created a Bootstrap Action on the EmrCluster with the following value:
s3://project/bin/scripts/BootstrapActions.sh
This is BootstrapActions.sh
#!/usr/bin/env bash
sudo pip install boto3
This worked!!!!!!!
Upvotes: 5
Views: 5529
Reputation: 327
This is a helpful thread to address a simple problem that was surprisingly difficult to debug. I ended up using the Resource's - Run As User field set to root. I hate running as root (I tried ec2-user to no avail), but it was the only thing that gave my python script the permissions on site-packages. Apparently the TaskRunner service doesn't have sudo access so running sudo commands inside the .sh just fails silently.
Upvotes: 0
Reputation: 18270
Configure ShellCommandActivity with
Script Uri
.#!/usr/bin/env python
in the
script.runsOn
is chosen, Add the installation commands as the bootstrap action for the EMR Resource.workerGroup
is chosen, Install all the libraries on the Worker group before pipeline activation.Use either pip
or easy_install
to install the python modules.
Upvotes: 9