Akshay Apte
Akshay Apte

Reputation: 1653

Including another file in Dataflow Python flex template, ImportError

Is there an example of a Python Dataflow Flex Template with more than one file where the script is importing other files included in the same folder?

My project structure is like this:

├── pipeline
│   ├── __init__.py
│   ├── main.py
│   ├── setup.py
│   ├── custom.py

I'm trying to import custom.py inside of main.py for a dataflow flex template.

I receive the following error in the pipeline execution:

ModuleNotFoundError: No module named 'custom'

The pipeline works fine if I include all of the code in a single file and don't make any imports.

Example Dockerfile:

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base

ARG WORKDIR=/dataflow/template/pipeline
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY pipeline /dataflow/template/pipeline

COPY spec/python_command_spec.json /dataflow/template/

ENV DATAFLOW_PYTHON_COMMAND_SPEC /dataflow/template/python_command_spec.json

RUN pip install avro-python3 pyarrow==0.11.1 apache-beam[gcp]==2.24.0

ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"

Python spec file:

{
    "pyFile":"/dataflow/template/pipeline/main.py"
}
  

I am deploying the template with the following command:

gcloud builds submit --project=${PROJECT} --tag ${TARGET_GCR_IMAGE} .

Upvotes: 13

Views: 3759

Answers (6)

Valentyn
Valentyn

Reputation: 565

The crux of the problem is that the package is not installed in launch environment, hence some modules might not be importable depending on the current directory and/or value of $PYTHONPATH. Structure the pipeline as a package, and install it. For example, consider the following structure:

/template      # Location of the template files in target image. 
  ├── some_package
  │   ├── launcher.py        # Parses command line args, calls Pipeline.run().
  │   ├── some_pipeline.py   # Pipeline(s) could be defined in separate file(s).
  │   ├── some_transforms.py # Building blocks to reference in other modules.  
  │   └── utils -> # You can have subpackages too.
  │        └── some_helper_functions.py
  ├── main.py   # Entrypoint. Calls `launcher.some_run_method()`.
  └── setup.py  # Defines the package and its requirements

Flex template Dockerfile might look like the following:

ARG WORKDIR=/template
WORKDIR ${WORKDIR}                
COPY setup.py .
COPY main.py .
COPY some_package some_package

# This is the key line to solve the problem discussed in this question. 
# Installing the package allows importing its modules regardless of current path.
RUN pip install -e .  

ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

...

Upvotes: 0

aec
aec

Reputation: 1135

I have a bunch of pipelines all in the same repo, where all of the pipelines need to use my packages in the utils directory.

The solution for me was to add a symlink in each pipeline directory to the utils directory. That was not required for me to run locally, run using Dataflow Runner, or create and run a Classic Template. But, it was necessary to run a Flex template.

pipelines/
├── pipeline_1
│   ├── pipeline_1_metadata
│   ├── pipeline_1.py
│   ├── bin
│   │   ├── build_flex_template_and_image.sh
│   │   ├── run_flex_template.sh
│   │   ├── ...
│   ├── README.md
│   └── utils -> ../utils # Added this, and it worked
├── pipeline_2
│   ├── pipeline_2_metadata
│   ├── pipeline_2.py
│   ├── bin
│   │   ├── build_flex_template_and_image.sh
│   │   ├── run_flex_template.sh
│   │   ├── ...
│   ├── README.md
│   └── utils -> ../utils # Added this, and it worked
├── # etc.
├── requirements.txt
├── setup.py
|── utils
    ├── bigquery_utils.py
    ├── dprint.py
    ├── gcs_file_utils.py
    └── misc.py

My setup.py:

import setuptools

setuptools.setup(
    name="repo_name_here",
    version="0.2",
    install_requires=[], # Maybe upgrade Beam here?
    packages=setuptools.find_namespace_packages(exclude=["*venv*"]),
)

From the base directory, I build like so, using the Google-provided Docker image:

gcloud dataflow flex-template build "${TEMPLATE_FILE}" \
   --image-gcr-path "${REGION}-docker.pkg.dev/${PROJECT}/${ARTIFACT_REPO}/dataflow/pipeline-1:latest" \
   --sdk-language "PYTHON" \
   --flex-template-base-image "PYTHON3" \
   --py-path "." \
   --metadata-file "pipeline_1/pipeline_1_metadata" \
   --env "FLEX_TEMPLATE_PYTHON_PY_FILE=pipeline_1/pipeline_1.py" \
   --env "FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE=flex_requirements.txt" \
   --env "FLEX_TEMPLATE_PYTHON_SETUP_FILE=setup.py"

That totally works, but now the only hassle is that I can't use the default requirements.txt with the default Python image, since I can't figure out how to install the correct version of Python 3.9 in my venv and update requirements.txt accordingly, so I generate flex_requirements.txt via cut -d "=" -f 1 requirements.txt > flex_requirements.txt and let the base image figure out the dependencies. Which is insanity. But that'll be another Stack Overflow issue if I can't figure it out in another couple days.

Upvotes: 0

jamiet
jamiet

Reputation: 12314

Here is my solution:

Dockerfile:

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base:flex_templates_base_image_release_20210120_RC00

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY requirements.txt .


# Read https://stackoverflow.com/questions/65766066/can-i-make-flex-template-jobs-take-less-than-10-minutes-before-they-start-to-pro#comment116304237_65766066
# to understand why apache-beam is not being installed from requirements.txt
RUN pip install --no-cache-dir -U apache-beam==2.26.0
RUN pip install --no-cache-dir -U -r ./requirements.txt

COPY mymodule.py setup.py ./
COPY protoc_gen protoc_gen/

ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/mymodule.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

and here is my setup.py:

import setuptools

setuptools.setup(
    packages=setuptools.find_packages(),
    install_requires=[],
    name="my df job modules",
)

Upvotes: 1

Idhem
Idhem

Reputation: 964

For me I didn't need to integrate the setup_file in the command to trigger the flex template, here is my Dockerfile:

FROM gcr.io/dataflow-templates-base/python38-template-launcher-base

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY . .

ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

# Install apache-beam and other dependencies to launch the pipeline
RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt

This is the command:

gcloud dataflow flex-template run "job_ft" --template-file-gcs-location "$TEMPLATE_PATH" --parameters paramA="valA" --region "europe-west1"

Upvotes: 0

rsantiago
rsantiago

Reputation: 2099

After some tests I found out that for some unknown reasons phyton files at working directory (WORKDIR) cannot be referenced with an import. But it works if you create a subfolder and move the python dependencies into it. I tested and it worked, for example, in your use case you can have the following structure:

├── pipeline
│   ├── main.py
│   ├── setup.py
│   ├── mypackage
│   │   ├── __init__.py
│   │   ├── custom.py

And you will be able to reference: import mypackage.custom. The Docker file should move in the custom.py to proper directory.

RUN mkdir -p ${WORKDIR}/mypackage
RUN touch ${WORKDIR}/mypackage/__init__.py
COPY custom.py ${WORKDIR}/mypackage

And the dependecy will be added to the python installation directory:

$ docker exec -it <container> /bin/bash
# find / -name custom.py
/usr/local/lib/python3.7/site-packages/mypackage/custom.py

Upvotes: 3

Akshay Apte
Akshay Apte

Reputation: 1653

I actually solved this by passing an additional parameter setup_file to the template execution. Also need to add setup_file parameter to the template metadata

--parameters setup_file="/dataflow/template/pipeline/setup.py"

Apparently the command ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py" in the Dockerfile is useless and doesnt actually pick up the setup file.

My setup file looked like this:

import setuptools

setuptools.setup(
    packages=setuptools.find_packages(),
    install_requires=[
        'apache-beam[gcp]==2.24.0'
    ],
 )

Upvotes: 5

Related Questions