Reputation: 97
UPDATE:
After follow this instructions https://stackoverflow.com/a/49450625/14571370
the workers continue to fail in finding dependencies when trying to find the firts import in the script
Unpacking an archive s3a://my-bucket/dependencies/spark-upload-5d9d9645-01fe-4979-8014-b9da1810d300/pyspark_venv.tar.gz#environment
from /tmp/spark-aaadcabb-bf96-4531-82c1-858389224ff4/pyspark_venv.tar.gz
to /opt/spark/./environment
File "/tmp/spark-1a1b36e7-64d9-4460-b346-ffb4afd67860/policy_processor.py"
ModuleNotFoundError: No module named 'findspark'
In particular I included
RUN mkdir /opt/spark/conf
RUN chmod -R 777 /opt/spark/conf
RUN echo "export PYSPARK_PYTHON=/opt/spark/work-dir/env/bin/python3.9" > /opt/spark/conf/spark-env.sh
in the spark image that the workers pull (The dependencies are installed in the virtual env ). This because I tried virtually all the suggestions that implied modifications to the spark-submit command, in order to load dependencies to s3, and none of them worked. I don't know if it has something to do with the code running inside the temp folder in the workers, or if the bas spark image that I pulled from DockerHub docker pull apache/spark:v3.3.2
is not the right one
Help would be very much appreciated
I'm trying to run a pyspark job on kubernetes. I'm setting the python dependencies using virtual env as noted here: https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
I'm uploading those dependencies to s3 as described here:https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management
However I'm getting this error:
Unpacking an archive local://pyspark_venv.tar.gz#environment from /opt/spark to /opt/spark/./environment
Exception in thread "main" java.io.FileNotFoundException: /opt/spark
at org.apache.spark.util.Utils$.unpack(Utils.scala:597)
Here´s the Dockerfile relevant piece:
# INSTALL DEPENDENCIES
COPY requirements.txt .
ENV PATH /pyspark_venv/bin:$PATH
ENV VIRTUAL_ENV /pyspark_venv
RUN python -m venv /pyspark_venv
RUN which python
RUN source pyspark_venv/bin/activate
RUN pip install -r requirements.txt
RUN venv-pack -o pyspark_venv.tar.gz
RUN mkdir /opt/cert
# SET SPARK ENV VARIABLES
ENV PYSPARK_PYTHON=./environment/bin/python
ENV PATH="${SPARK_HOME}/bin/:${PATH}"
# SET PYSPARK VARIABLES
ENV PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
# COPY APP FILES:
COPY . .
RUN chmod +x ./docker/entrypoint.sh
ENTRYPOINT ["./docker/entrypoint.sh"]
and here the spark-submit command:
dep = 'pyspark_venv.tar.gz#environment'
cmd = f""" {SPARK_HOME}/bin/spark-submit
--master {SPARK_MASTER}
--deploy-mode cluster
--name spark-policy-engine
--executor-memory {EXECUTOR_MEMORY}
--conf spark.executor.instances={N_EXECUTORS}
--conf spark.kubernetes.container.image={SPARK_IMAGE}
--conf spark.kubernetes.authenticate.driver.serviceAccountName={SPARK_ROLE}
--conf spark.kubernetes.namespace={NAMESPACE}
--conf spark.kubernetes.authenticate.caCertFile=/opt/selfsigned_certificate.pem
--conf spark.kubernetes.authenticate.submission.oauthToken={K8S_TOKEN}
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
--conf spark.hadoop.fs.s3a.access.key={S3_CONFIG['aws_access_key_id']}
--conf spark.hadoop.fs.s3a.secret.key={S3_CONFIG['aws_secret_access_key']}
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp
--conf spark.kubernetes.file.upload.path=s3a://{S3_CONFIG['bucket']}/dependencies
--packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901,org.apache.hadoop:hadoop-common:3.3.1
--archives local://{dep} {spark_files} """
The spark image (with custom dependencies) for the workers is:
FROM apache/spark-py:v3.3.2
USER root
RUN chmod -R 777 /opt/spark
RUN apt-get -y update; apt-get -y install curl software-properties-common
RUN apt install python3.9
RUN pip install --upgrade setuptools pip
RUN apt install python3.9-venv
RUN mkdir -p /opt/spark/.ivy2/jars/
RUN curl https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-webapp/9.4.40.v20210413/jetty-webapp-9.4.40.v20210413.jar --output /opt/spark/.ivy2/jars/org.eclipse.jetty_jetty-webapp-9.4.40.v20210413.jar
RUN curl https://repo1.maven.org/maven2/com/github/stephenc/jcip/jcip-annotations/1.0-1/jcip-annotations-1.0-1.jar --output /opt/spark/.ivy2/jars/com.github.stephenc.jcip_jcip-annotations-1.0-1.jar
RUN chmod -R 777 /opt/spark/.ivy2/jars/com.github.stephenc.jcip_jcip-annotations-1.0-1.jar
RUN python3.9 -m venv env
RUN which python3
RUN env/bin/pip3.9 install boto3==1.26.88
RUN env/bin/pip3.9 install botocore==1.29.88
RUN env/bin/pip3.9 install confluent-kafka==2.0.2
RUN env/bin/pip3.9 install distlib==0.3.6
RUN env/bin/pip3.9 install dnspython==2.3.0
RUN env/bin/pip3.9 install filelock==3.9.0
RUN env/bin/pip3.9 install findspark==2.0.1
RUN env/bin/pip3.9 install jmespath==1.0.1
RUN env/bin/pip3.9 install platformdirs==3.0.0
RUN env/bin/pip3.9 install py4j==0.10.9.5
RUN env/bin/pip3.9 install pymongo==4.3.3
RUN env/bin/pip3.9 install pyspark==3.3.2
RUN env/bin/pip3.9 install python-dateutil==2.8.2
RUN env/bin/pip3.9 install python-decouple==3.7
RUN env/bin/pip3.9 install s3transfer==0.6.0
RUN env/bin/pip3.9 install six==1.16.0
RUN env/bin/pip3.9 install urllib3==1.26.14
RUN env/bin/pip3.9 install venv-pack==0.2.0
RUN mkdir /opt/spark/conf
RUN chmod -R 777 /opt/spark/conf
RUN echo "export PYSPARK_PYTHON=/opt/spark/work-dir/env/bin/python3.9" > /opt/spark/conf/spark-env.sh
I tried many iterations with different config but nothing seems to work, maybe I'm mixing up something
Upvotes: 0
Views: 605
Reputation: 101
Please try the following
Make a zip file from the dependencies and push it into a S3 bucket so that it can be pulled by spark-submit before the app starts.
pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .
Then sync the file with the S3 bucket
s3cmd -c <path-to-s3-config> sync dependencies.zip s3://<bucket-name>/<prefix>
Finally, add the following to the spark-submit
command
--py-files s3a://<bucket-name>/<prefix>/dependencies.zip
Note the s3a here.
Upvotes: 0