ddjanke
ddjanke

Reputation: 121

Dataflow flex template job attempts to launch second job (for pipeline) with same job_name

I am trying to launch a Dataflow flex template. As part of the build and deploy process, I am pre-building a custom SDK container image to reduce worker start-up time.

I have attempted this in these ways:

  1. When no sdk_container_image is specified and a requirements.txt file is provided, a Dataflow flex template is successfully launched and a graph is built, but the workers cannot start because they lack authority to install private packages.
  1. When an sdk_container_image is given (with dependencies pre-installed), the Dataflow job starts, but instead of running the pipeline on the same job, it attempts to launch a Dataflow job for the pipeline using the same name, which results in an error.
  1. If I attempt to pass a second job_name for the pipeline, the pipeline successfully starts in a separate job, but the original flex template job eventually fails due to a polling timeout. Pipeline fails at very last step due to sdk harness disconnected because "The worker VM had to shut down one or more processes due to lack of memory."
  1. When I launch the pipeline locally using DataflowRunner, the pipeline runs successfully under one job name.

Here are my Dockerfiles and gcloud commands:

Flex template Dockerfile:

FROM gcr.io/dataflow-templates-base/python39-template-launcher-base

# Create working directory
ARG WORKDIR=/flex
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
#   https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
RUN apt-get update && apt-get install -y libffi-dev && rm -rf /var/lib/apt/lists/*

COPY ./ ./

ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/launch_pipeline.py"

# Install the pipeline dependencies
RUN pip install --no-cache-dir --upgrade pip setuptools wheel
RUN pip install --no-cache-dir apache-beam[gcp]==2.41.0
RUN pip install --no-cache-dir -r requirements.txt

ENTRYPOINT [ "/opt/google/dataflow/python_template_launcher" ]

Worker Dockerfile:

# Set up image for worker.
FROM apache/beam_python3.9_sdk:2.41.0

WORKDIR /worker

COPY ./requirements.txt ./

RUN pip install --no-cache-dir --upgrade pip setuptools wheel
RUN pip install --no-cache-dir -r requirements.txt

Building template:

gcloud dataflow flex-template build $TEMPLATE_LOCATION \
    --image "$IMAGE_LOCATION" \
    --sdk-language "PYTHON" \
    --metadata-file "metadata.json"

Launching template:

gcloud dataflow flex-template run ddjanke-local-flex \
    --template-file-gcs-location=$TEMPLATE_LOCATION \
    --project=$PROJECT \
    --service-account-email=$EMAIL \
    --parameters=[OTHER_ARGS...],sdk_container_image=$WORKER_IMAGE \
    --additional-experiments=use_runner_v2

Upvotes: 1

Views: 736

Answers (1)

ddjanke
ddjanke

Reputation: 121

I actually managed to solve this yesterday. The problem was that I was passing sdk_container_image to Dataflow through the flex template and then passing that through to the PipelineOptions within my code. After I removed sdk_container_image from the options, it launched the pipeline in the same job.

Upvotes: 1

Related Questions