lili
lili

Reputation: 25

Create dockerfile to use airflow and spark, pip backtracking runtime issue comes out

I'm tring to build dockerfile to use airflow and spark as follows

FROM apache/airflow:2.7.0-python3.9

ENV AIRFLOW_HOME=/opt/airflow

USER root

# Update the package list, install required packages, and clean up
RUN apt-get update && \
    apt-get install -y gcc python3-dev openjdk-11-jdk  wget && \
    apt-get clean

# Set the JAVA_HOME environment variable
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

COPY requirements.txt .

USER airflow
RUN pip install -U pip
RUN pip install --no-cache-dir -r requirements.txt

My requirements.txt is

apache-airflow
apache-airflow-providers-apache-spark
apache-airflow-providers-celery>=3.3.0
apache-airflow-providers-google
pandas
psycopg2-binary
pytest
pyspark
requests
sqlalchemy

And it would take extremely long time to build and I kept getting info as below

INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime.
 => => #   Downloading google_cloud_workflows-1.16.0-py2.py3-none-any.whl.metadata (5.2 kB)

And if I remove python3.9 in the first line of my dockerfile, then I'm unable to install openjdk-11-jdk.

Does anyone know how to solve it, thank you

Upvotes: 1

Views: 29

Answers (1)

Bhargav
Bhargav

Reputation: 4251

Try Using Airflow's official constraints file - The file. The constraints file contains pre-computed compatible dependencies for Airflow, which drastically reduces pip's need to calculate dependencies on its own.

FROM apache/airflow:2.7.0-python3.9

ENV AIRFLOW_HOME=/opt/airflow

USER root

# Update the package list, install required packages, and clean up
RUN apt-get update && \
    apt-get install -y gcc python3-dev openjdk-11-jdk wget && \
    apt-get clean

# Set the JAVA_HOME environment variable
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

COPY requirements.txt .

USER airflow
RUN pip install --upgrade pip
# Use pip's constraint mode to avoid backtracking
RUN pip install --no-cache-dir --use-pep517 --constraint=https://raw.githubusercontent.com/apache/airflow/constraints-2.7.0/constraints-3.9.txt -r requirements.txt

The requiremnets

apache-airflow
apache-airflow-providers-apache-spark
apache-airflow-providers-celery>=3.3.0
apache-airflow-providers-google==10.1.0
pandas
psycopg2-binary
pytest
pyspark
requests
sqlalchemy

Upvotes: 0

Related Questions