user908759
user908759

Reputation: 1355

How to make a AWS Data Pipeline ShellCommandActivity Script execute a python file

I am working with an AWS Data Pipeline that has a ShellCommandActivity that sets the script uri to bash file located in a s3 bucket. The bash file copies a python script located in the same s3 bucket to a EmrCluster and then the script tries to execute that python script.

enter image description here

This is my pipeline export:

{
  "objects": [
    {
      "name": "DefaultResource1",
      "id": "ResourceId_27dLM",
      "amiVersion": "3.9.0",
      "type": "EmrCluster",
      "region": "us-east-1"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "s3://project/bin/scripts/logs/",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "stage": "true",
      "scriptUri": "s3://project/bin/scripts/RunPython.sh",
      "name": "DefaultShellCommandActivity1",
      "id": "ShellCommandActivityId_hA57k",
      "runsOn": {
        "ref": "ResourceId_27dLM"
      },
      "type": "ShellCommandActivity"
    }
  ],
  "parameters": []
}

This is RunPython.sh:

#!/usr/bin/env bash
aws s3 cp s3://project/bin/scripts/Test.py ./
python ./Test.py

This is Test.py

__author__ = 'MrRobot'
import re
import os
import sys
import boto3

print "We've entered the python file"

From the Stdout Log I get:

download: s3://project/bin/scripts/Test.py to ./

From the Stdeer Log I get:

python: can't open file 'Test.py': [Errno 2] No such file or directory

I have also tried replacing python ./Test.py with python Test.py, but I get the same result.

How do I get my AWS Data Pipeline to execute my Test.py script.

EDIT

When I set scriptUri to s3://project/bin/scripts/Test.py I get the following errors :

/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 1: author: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 2: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 3: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 4: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 5: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 7: print: command not found

EDIT 2

Added the following line to Test.py

#!/usr/bin/env python

Then I received the following error:

error: line 6, in import boto3 ImportError: No module named boto3

using @franklinsijo 's advice I created a Bootstrap Action on the EmrCluster with the following value:

s3://project/bin/scripts/BootstrapActions.sh

This is BootstrapActions.sh

#!/usr/bin/env bash
sudo pip install boto3

This worked!!!!!!!

Upvotes: 5

Views: 5529

Answers (2)

Aaron Soellinger
Aaron Soellinger

Reputation: 327

This is a helpful thread to address a simple problem that was surprisingly difficult to debug. I ended up using the Resource's - Run As User field set to root. I hate running as root (I tried ec2-user to no avail), but it was the only thing that gave my python script the permissions on site-packages. Apparently the TaskRunner service doesn't have sudo access so running sudo commands inside the .sh just fails silently.

Upvotes: 0

franklinsijo
franklinsijo

Reputation: 18270

Configure ShellCommandActivity with

  • Pass the S3 Uri Path of the python file as the Script Uri.
  • Add the shebang line #!/usr/bin/env python in the script.
  • If any non-default python libraries are used in the script, install them on the target resource.
    • If runsOn is chosen, Add the installation commands as the bootstrap action for the EMR Resource.
    • If workerGroup is chosen, Install all the libraries on the Worker group before pipeline activation.

Use either pip or easy_install to install the python modules.

Upvotes: 9

Related Questions