Rodrigo Alarcón
Rodrigo Alarcón

Reputation: 97

Pyspark on K8s: Dependencies not found by workers


After follow this instructions

the workers continue to fail in finding dependencies when trying to find the firts import in the script

Unpacking an archive s3a://my-bucket/dependencies/spark-upload-5d9d9645-01fe-4979-8014-b9da1810d300/pyspark_venv.tar.gz#environment
 from /tmp/spark-aaadcabb-bf96-4531-82c1-858389224ff4/pyspark_venv.tar.gz
 to /opt/spark/./environment

File "/tmp/spark-1a1b36e7-64d9-4460-b346-ffb4afd67860/"

ModuleNotFoundError: No module named 'findspark'

In particular I included

RUN mkdir /opt/spark/conf
RUN chmod -R 777 /opt/spark/conf
RUN echo "export PYSPARK_PYTHON=/opt/spark/work-dir/env/bin/python3.9" > /opt/spark/conf/

in the spark image that the workers pull (The dependencies are installed in the virtual env ). This because I tried virtually all the suggestions that implied modifications to the spark-submit command, in order to load dependencies to s3, and none of them worked. I don't know if it has something to do with the code running inside the temp folder in the workers, or if the bas spark image that I pulled from DockerHub docker pull apache/spark:v3.3.2 is not the right one

Help would be very much appreciated

I'm trying to run a pyspark job on kubernetes. I'm setting the python dependencies using virtual env as noted here:

I'm uploading those dependencies to s3 as described here:

However I'm getting this error:

Unpacking an archive local://pyspark_venv.tar.gz#environment from /opt/spark to /opt/spark/./environment
Exception in thread "main" /opt/spark
at org.apache.spark.util.Utils$.unpack(Utils.scala:597)

Here´s the Dockerfile relevant piece:

COPY requirements.txt .
ENV PATH /pyspark_venv/bin:$PATH  
ENV VIRTUAL_ENV /pyspark_venv                                  

RUN python -m venv /pyspark_venv
RUN which python
RUN source pyspark_venv/bin/activate
RUN pip install -r requirements.txt
RUN venv-pack -o pyspark_venv.tar.gz

RUN mkdir /opt/cert

ENV PYSPARK_PYTHON=./environment/bin/python


COPY . .

RUN chmod +x ./docker/
ENTRYPOINT ["./docker/"]

and here the spark-submit command:

dep = 'pyspark_venv.tar.gz#environment'
cmd = f""" {SPARK_HOME}/bin/spark-submit
    --master {SPARK_MASTER}
    --deploy-mode cluster
    --name spark-policy-engine
    --executor-memory {EXECUTOR_MEMORY}
    --conf spark.executor.instances={N_EXECUTORS} 
    --conf spark.kubernetes.container.image={SPARK_IMAGE}
    --conf spark.kubernetes.authenticate.driver.serviceAccountName={SPARK_ROLE}
    --conf spark.kubernetes.namespace={NAMESPACE}
    --conf spark.kubernetes.authenticate.caCertFile=/opt/selfsigned_certificate.pem
    --conf spark.kubernetes.authenticate.submission.oauthToken={K8S_TOKEN}
    --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
    --conf spark.hadoop.fs.s3a.access.key={S3_CONFIG['aws_access_key_id']}
    --conf spark.hadoop.fs.s3a.secret.key={S3_CONFIG['aws_secret_access_key']}
    --conf spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp
    --conf spark.kubernetes.file.upload.path=s3a://{S3_CONFIG['bucket']}/dependencies
    --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901,org.apache.hadoop:hadoop-common:3.3.1
    --archives local://{dep} {spark_files} """

The spark image (with custom dependencies) for the workers is:

 FROM apache/spark-py:v3.3.2

USER root
RUN chmod -R 777 /opt/spark
RUN apt-get -y update; apt-get -y install curl software-properties-common
RUN apt install python3.9
RUN pip install --upgrade setuptools pip 
RUN apt install python3.9-venv

RUN mkdir -p /opt/spark/.ivy2/jars/

RUN curl --output /opt/spark/.ivy2/jars/org.eclipse.jetty_jetty-webapp-9.4.40.v20210413.jar
RUN curl --output /opt/spark/.ivy2/jars/com.github.stephenc.jcip_jcip-annotations-1.0-1.jar
RUN chmod -R 777 /opt/spark/.ivy2/jars/com.github.stephenc.jcip_jcip-annotations-1.0-1.jar

RUN python3.9 -m venv env
RUN which python3

RUN env/bin/pip3.9 install boto3==1.26.88
RUN env/bin/pip3.9 install botocore==1.29.88
RUN env/bin/pip3.9 install confluent-kafka==2.0.2
RUN env/bin/pip3.9 install distlib==0.3.6
RUN env/bin/pip3.9 install dnspython==2.3.0
RUN env/bin/pip3.9 install filelock==3.9.0
RUN env/bin/pip3.9 install findspark==2.0.1
RUN env/bin/pip3.9 install jmespath==1.0.1
RUN env/bin/pip3.9 install platformdirs==3.0.0
RUN env/bin/pip3.9 install py4j==
RUN env/bin/pip3.9 install pymongo==4.3.3
RUN env/bin/pip3.9 install pyspark==3.3.2
RUN env/bin/pip3.9 install python-dateutil==2.8.2
RUN env/bin/pip3.9 install python-decouple==3.7
RUN env/bin/pip3.9 install s3transfer==0.6.0
RUN env/bin/pip3.9 install six==1.16.0
RUN env/bin/pip3.9 install urllib3==1.26.14
RUN env/bin/pip3.9 install venv-pack==0.2.0

RUN mkdir /opt/spark/conf
RUN chmod -R 777 /opt/spark/conf
RUN echo "export PYSPARK_PYTHON=/opt/spark/work-dir/env/bin/python3.9" > /opt/spark/conf/

I tried many iterations with different config but nothing seems to work, maybe I'm mixing up something

Upvotes: 0

Views: 605

Answers (1)

Sajad Safarveisi
Sajad Safarveisi

Reputation: 101

Please try the following

Make a zip file from the dependencies and push it into a S3 bucket so that it can be pulled by spark-submit before the app starts.

pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../ .

Then sync the file with the S3 bucket

s3cmd -c <path-to-s3-config> sync s3://<bucket-name>/<prefix>

Finally, add the following to the spark-submit command

--py-files s3a://<bucket-name>/<prefix>/

Note the s3a here.

Upvotes: 0

Related Questions