nirkov
nirkov

Reputation: 809

How to run a Python project (package) on AWS EMR serverless?

I have a Python project with several modules, classes, and dependencies files (a requirements.txt file). I want to pack it into one file with all the dependencies and give the file path to AWS EMR serverless, which will run it.

The problem is that I don't understand how to pack a Python project with all the dependencies, which file the EMR can consume, etc. All the examples I have found used one Python file.

In simple words, what should I do if my Python project is not a single file but is more complex?

Upvotes: 4

Views: 7435

Answers (1)

dacort
dacort

Reputation: 863

There's a few ways to do this with EMR Serverless. Regardless of which way you choose, you will need to provide a main entrypoint Python script to the EMR Serverless StartJobRun command.

Let's assume you've got a job structure like this where main.py is your entrypoint that creates a Spark session and runs your jobs and job1 and job2 are your local modules.

├── jobs
│   └── job1.py
│   └── job2.py
├── main.py
├── requirements.txt

Option 1. Use --py-files with your zipped local modules and --archives with a packaged virtual environment for your external dependencies

  • Zip up your job files
zip -r job_files.zip jobs
  • Create a virtual environment using venv-pack with your dependencies.

Note: This has to be done with a similar OS and Python version as EMR Serverless, so I prefer using a multi-stage Dockerfile with custom outputs.

FROM --platform=linux/amd64 amazonlinux:2 AS base

RUN yum install -y python3

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

COPY requirements.txt .

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install venv-pack==0.2.0 && \
    python3 -m pip install -r requirements.txt

RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /

If you run DOCKER_BUILDKIT=1 docker build --output . ., you should now have a pyspark_deps.tar.gz file on your local system.

  • Upload main.py, job_files.zip, and pyspark_deps.tar.gz to a location on S3.

  • Run your EMR Serverless job with a command like this (replacing APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET):

aws emr-serverless start-job-run \
    --application-id $APPLICATION_ID \
    --execution-role-arn $JOB_ROLE_ARN \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<YOUR_BUCKET>/main.py",
            "sparkSubmitParameters": "--py-files s3://<YOUR_BUCKET>/job_files.zip --conf spark.archives=s3://<YOUR_BUCKET>/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
        }
    }'

Option 2. Package your local modules as a Python library and use --archives with a packaged virtual environment

This is probably the most reliable way, but it will require you to use setuptools. You can use a simple pyproject.toml file along with your existing requirements.txt

[project]
name = "mysparkjobs"
version = "0.0.1"
dynamic = ["dependencies"]
[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}

You then can use a multi-stage Dockerfile and custom build outputs to package your modules and dependencies into a virtual environment.

Note: This requires you to enable Docker Buildkit

FROM --platform=linux/amd64 amazonlinux:2 AS base

RUN yum install -y python3

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

WORKDIR /app
COPY . .
RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install venv-pack==0.2.0 && \
    python3 -m pip install .

RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /

Now you can run DOCKER_BUILDKIT=1 docker build --output . . and a pyspark_deps.tar.gz file will be generated with all your dependencies. Upload this file and your main.py script to S3.

Assuming you uploaded both files to s3://<YOUR_BUCKET>/code/pyspark/myjob/, run the EMR Serverless job like this (replacing the APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET:

aws emr-serverless start-job-run \
    --application-id <APPLICATION_ID> \
    --execution-role-arn <JOB_ROLE_ARN> \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<YOUR_BUCKET>/code/pyspark/myjob/main.py",
            "sparkSubmitParameters": "--conf spark.archives=s3://<YOUR_BUCKET>/code/pyspark/myjob/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
        }
    }'

Note the additional sparkSubmitParameters that specify your dependencies and configure the driver and executor environment variables for the proper paths to python.

Upvotes: 13

Related Questions