Reputation: 250
Apache Beam with Cloud Data Flow executors takes 5 minutes or more to cold start the Data Pipeline ? Is there any way to minimize the start up time ?
Tried optimizing the Dockerfile but still slower.
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base:20240812-rc01
ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="/template/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/dataflow_pipelines/data_pipelinie_main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="/template/setup.py"
ENV PYTHONPATH="/template"
WORKDIR /template
COPY requirements.txt /template/
COPY setup.py /template/
COPY .env /template/
COPY dataflow_pipelines/ /template/dataflow_pipelines/
RUN apt-get update && \
apt-get install -y --no-install-recommends libffi-dev git && \
rm -rf /var/lib/apt/lists/* && \
pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir apache-beam[gcp]==2.61.0 && \
pip install --no-cache-dir -r "$FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE" && \
pip download --no-cache-dir --dest /tmp/dataflow-requirements-cache -r "$FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE" && \
python -m pip install -e . && \
ls -R /template
ENV PIP_NO_DEPS=True
ENTRYPOINT ["/opt/google/dataflow/python_template_launcher"]
Upvotes: 0
Views: 43
Reputation: 180
Not sure if this would work but it doesn’t hurt to try, build your container image once and store it in Artifact Registry or Container Registry. Then, reference this pre-built image in your Dataflow job creation request. This eliminates the on-the-fly container build time, which is a significant contributor to cold starts.
gcloud builds submit --tag gcr.io/[PROJECT_ID]/[IMAGE_NAME]
Specify your pre-built image URI when creating your Dataflow job.
Upvotes: 0