Pyspark on K8s: Dependencies not found by workers

Question

UPDATE:

After follow this instructions https://stackoverflow.com/a/49450625/14571370

the workers continue to fail in finding dependencies when trying to find the firts import in the script

Unpacking an archive s3a://my-bucket/dependencies/spark-upload-5d9d9645-01fe-4979-8014-b9da1810d300/pyspark_venv.tar.gz#environment
 from /tmp/spark-aaadcabb-bf96-4531-82c1-858389224ff4/pyspark_venv.tar.gz
 to /opt/spark/./environment

File "/tmp/spark-1a1b36e7-64d9-4460-b346-ffb4afd67860/policy_processor.py"

ModuleNotFoundError: No module named 'findspark'

In particular I included

RUN mkdir /opt/spark/conf
RUN chmod -R 777 /opt/spark/conf
RUN echo "export PYSPARK_PYTHON=/opt/spark/work-dir/env/bin/python3.9" > /opt/spark/conf/spark-env.sh

in the spark image that the workers pull (The dependencies are installed in the virtual env ). This because I tried virtually all the suggestions that implied modifications to the spark-submit command, in order to load dependencies to s3, and none of them worked. I don't know if it has something to do with the code running inside the temp folder in the workers, or if the bas spark image that I pulled from DockerHub docker pull apache/spark:v3.3.2 is not the right one

Help would be very much appreciated

I'm trying to run a pyspark job on kubernetes. I'm setting the python dependencies using virtual env as noted here: https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html

I'm uploading those dependencies to s3 as described here:https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management

However I'm getting this error:

Unpacking an archive local://pyspark_venv.tar.gz#environment from /opt/spark to /opt/spark/./environment
Exception in thread "main" java.io.FileNotFoundException: /opt/spark
at org.apache.spark.util.Utils$.unpack(Utils.scala:597)

Here´s the Dockerfile relevant piece:

# INSTALL DEPENDENCIES
COPY requirements.txt .
ENV PATH /pyspark_venv/bin:$PATH  
ENV VIRTUAL_ENV /pyspark_venv                                  

RUN python -m venv /pyspark_venv
RUN which python
RUN source pyspark_venv/bin/activate
RUN pip install -r requirements.txt
RUN venv-pack -o pyspark_venv.tar.gz

RUN mkdir /opt/cert

# SET SPARK ENV VARIABLES
ENV PYSPARK_PYTHON=./environment/bin/python
ENV PATH="${SPARK_HOME}/bin/:${PATH}"

# SET PYSPARK VARIABLES
ENV PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"

# COPY APP FILES:
COPY . .

RUN chmod +x ./docker/entrypoint.sh
ENTRYPOINT ["./docker/entrypoint.sh"]

and here the spark-submit command:

dep = 'pyspark_venv.tar.gz#environment'
cmd = f""" {SPARK_HOME}/bin/spark-submit
    --master {SPARK_MASTER}
    --deploy-mode cluster
    --name spark-policy-engine
    --executor-memory {EXECUTOR_MEMORY}
    --conf spark.executor.instances={N_EXECUTORS} 
    --conf spark.kubernetes.container.image={SPARK_IMAGE}
    --conf spark.kubernetes.authenticate.driver.serviceAccountName={SPARK_ROLE}
    --conf spark.kubernetes.namespace={NAMESPACE}
    --conf spark.kubernetes.authenticate.caCertFile=/opt/selfsigned_certificate.pem
    --conf spark.kubernetes.authenticate.submission.oauthToken={K8S_TOKEN}
    --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
    --conf spark.hadoop.fs.s3a.access.key={S3_CONFIG['aws_access_key_id']}
    --conf spark.hadoop.fs.s3a.secret.key={S3_CONFIG['aws_secret_access_key']}
    --conf spark.hadoop.fs.s3a.fast.upload=true
    --conf spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp
    --conf spark.kubernetes.file.upload.path=s3a://{S3_CONFIG['bucket']}/dependencies
    --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901,org.apache.hadoop:hadoop-common:3.3.1
    --archives local://{dep} {spark_files} """

The spark image (with custom dependencies) for the workers is:

 FROM apache/spark-py:v3.3.2

USER root
RUN chmod -R 777 /opt/spark
RUN apt-get -y update; apt-get -y install curl software-properties-common
RUN apt install python3.9
RUN pip install --upgrade setuptools pip 
RUN apt install python3.9-venv

RUN mkdir -p /opt/spark/.ivy2/jars/

RUN curl https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-webapp/9.4.40.v20210413/jetty-webapp-9.4.40.v20210413.jar --output /opt/spark/.ivy2/jars/org.eclipse.jetty_jetty-webapp-9.4.40.v20210413.jar
RUN curl https://repo1.maven.org/maven2/com/github/stephenc/jcip/jcip-annotations/1.0-1/jcip-annotations-1.0-1.jar --output /opt/spark/.ivy2/jars/com.github.stephenc.jcip_jcip-annotations-1.0-1.jar
RUN chmod -R 777 /opt/spark/.ivy2/jars/com.github.stephenc.jcip_jcip-annotations-1.0-1.jar

RUN python3.9 -m venv env
RUN which python3

RUN env/bin/pip3.9 install boto3==1.26.88
RUN env/bin/pip3.9 install botocore==1.29.88
RUN env/bin/pip3.9 install confluent-kafka==2.0.2
RUN env/bin/pip3.9 install distlib==0.3.6
RUN env/bin/pip3.9 install dnspython==2.3.0
RUN env/bin/pip3.9 install filelock==3.9.0
RUN env/bin/pip3.9 install findspark==2.0.1
RUN env/bin/pip3.9 install jmespath==1.0.1
RUN env/bin/pip3.9 install platformdirs==3.0.0
RUN env/bin/pip3.9 install py4j==0.10.9.5
RUN env/bin/pip3.9 install pymongo==4.3.3
RUN env/bin/pip3.9 install pyspark==3.3.2
RUN env/bin/pip3.9 install python-dateutil==2.8.2
RUN env/bin/pip3.9 install python-decouple==3.7
RUN env/bin/pip3.9 install s3transfer==0.6.0
RUN env/bin/pip3.9 install six==1.16.0
RUN env/bin/pip3.9 install urllib3==1.26.14
RUN env/bin/pip3.9 install venv-pack==0.2.0

RUN mkdir /opt/spark/conf
RUN chmod -R 777 /opt/spark/conf
RUN echo "export PYSPARK_PYTHON=/opt/spark/work-dir/env/bin/python3.9" > /opt/spark/conf/spark-env.sh

I tried many iterations with different config but nothing seems to work, maybe I'm mixing up something

Pyspark on K8s: Dependencies not found by workers

Answers (1)

Related Questions