Reputation: 121
I am trying to launch a Dataflow flex template. As part of the build and deploy process, I am pre-building a custom SDK container image to reduce worker start-up time.
I have attempted this in these ways:
sdk_container_image
is specified and a requirements.txt file is provided, a Dataflow flex template is successfully launched and a graph is built, but the workers cannot start because they lack authority to install private packages.sdk_container_image
is given (with dependencies pre-installed), the Dataflow job starts, but instead of running the pipeline on the same job, it attempts to launch a Dataflow job for the pipeline using the same name, which results in an error.Here are my Dockerfiles and gcloud
commands:
Flex template Dockerfile:
FROM gcr.io/dataflow-templates-base/python39-template-launcher-base
# Create working directory
ARG WORKDIR=/flex
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
# https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
RUN apt-get update && apt-get install -y libffi-dev && rm -rf /var/lib/apt/lists/*
COPY ./ ./
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/launch_pipeline.py"
# Install the pipeline dependencies
RUN pip install --no-cache-dir --upgrade pip setuptools wheel
RUN pip install --no-cache-dir apache-beam[gcp]==2.41.0
RUN pip install --no-cache-dir -r requirements.txt
ENTRYPOINT [ "/opt/google/dataflow/python_template_launcher" ]
Worker Dockerfile:
# Set up image for worker.
FROM apache/beam_python3.9_sdk:2.41.0
WORKDIR /worker
COPY ./requirements.txt ./
RUN pip install --no-cache-dir --upgrade pip setuptools wheel
RUN pip install --no-cache-dir -r requirements.txt
Building template:
gcloud dataflow flex-template build $TEMPLATE_LOCATION \
--image "$IMAGE_LOCATION" \
--sdk-language "PYTHON" \
--metadata-file "metadata.json"
Launching template:
gcloud dataflow flex-template run ddjanke-local-flex \
--template-file-gcs-location=$TEMPLATE_LOCATION \
--project=$PROJECT \
--service-account-email=$EMAIL \
--parameters=[OTHER_ARGS...],sdk_container_image=$WORKER_IMAGE \
--additional-experiments=use_runner_v2
Upvotes: 1
Views: 736
Reputation: 121
I actually managed to solve this yesterday. The problem was that I was passing sdk_container_image
to Dataflow through the flex template and then passing that through to the PipelineOptions within my code. After I removed sdk_container_image
from the options, it launched the pipeline in the same job.
Upvotes: 1