Reputation: 597
I am trying to deploy a pyspark application on k8s (with minikube) and followed the instructions from here: https://spark.apache.org/docs/2.4.6/running-on-kubernetes.html
I've built the images using docker tools and pushed it to my registry as well. Later I invoke spark-submit like this:
./bin/spark-submit --master k8s://https://127.0.0.1:49154 --deploy-mode cluster --name pyspark-on-k8s --conf spark.executor.instances=1 --conf spark.kubernetes.driver.container.image=jsoft88/conda_spark:2.4.6 --conf spark.kubernetes.executor.container.image=jsoft88/conda_spark:2.4.6 --conf spark.kubernetes.pyspark.pythonVersion=3 --conf spark.kubernetes.driverEnv.PYSPARK_DRIVER_PYTHON=/opt/miniconda3/envs/spark_env/bin/python --conf spark.kubernetes.driverEnv.PYSPARK_PYTHON=/opt/miniconda3/envs/spark_env/bin/python --conf spark.kubernetes.driverEnv.PYTHON_VERSION=3.7.3 /home/bitnami/spark-sample/app/main/sample_app.py --top 10
The *.driverEnv are just attempts that I made, because by default it is not using this python version, but python 3.8.5, which causes spark to throw an error like this:
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_K8S_CMD=driver-py
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '--top 10' ']'
+ PYSPARK_ARGS='--top 10'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' 3 == 2 ']'
+ '[' 3 == 3 ']'
++ python3 -V
+ pyv3='Python 3.8.5'
+ export PYTHON_VERSION=3.8.5
+ PYTHON_VERSION=3.8.5
+ export PYSPARK_PYTHON=python3
+ PYSPARK_PYTHON=python3
+ export PYSPARK_DRIVER_PYTHON=python3
+ PYSPARK_DRIVER_PYTHON=python3
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@" $PYSPARK_PRIMARY $PYSPARK_ARGS)
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner file:/home/bitnami/spark-sample/app/main/sample_app.py --top 10
21/03/04 12:55:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/home/bitnami/spark-sample/app/main/sample_app.py", line 4, in <module>
from pyspark.sql import DataFrame, SparkSession, functions
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
File "<frozen zipimport>", line 259, in load_module
File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 51, in <module>
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
File "<frozen zipimport>", line 259, in load_module
File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 31, in <module>
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
File "<frozen zipimport>", line 259, in load_module
File "/opt/spark/python/lib/pyspark.zip/pyspark/accumulators.py", line 97, in <module>
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
File "<frozen zipimport>", line 259, in load_module
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 72, in <module>
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
File "<frozen zipimport>", line 259, in load_module
File "/opt/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 145, in <module>
File "/opt/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
TypeError: an integer is required (got type bytes)
The idea is to have conda environment inside the container, with the application installed in the environment; so I extended the docker image generated by the docker-image-tool.sh
provided in the spark binaries, my dockerfile looks like this:
FROM jsoft88/spark-py:2.4.6
ENV PATH="/opt/miniconda3/bin:${PATH}"
ARG PATH="/opt/miniconda3/bin:${PATH}"
WORKDIR /home/bitnami
RUN apt update -y && apt install wget -y && wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
RUN chmod +x ./miniconda.sh
RUN ./miniconda.sh -b -f -p /opt/miniconda3
RUN rm -f miniconda.sh
RUN /opt/miniconda3/bin/conda init bash
COPY . /home/bitnami/spark-sample
RUN conda config --add channels conda-forge
RUN conda create --name spark_env --file /home/bitnami/spark-sample/requirements.txt --yes python=3.7.3
RUN . /opt/miniconda3/etc/profile.d/conda.sh && conda activate spark_env && cd /home/bitnami/spark-sample && pip install .
Requirements.txt:
python==3.7.3
pyspark==2.4.6
pytest
Upvotes: 0
Views: 953
Reputation: 597
Well, turns out that in spark 2.4.6, virtual environments are not supported in K8s:
# TODO: Investigate running both pip and pip3 via virtualenvs
So I went ahead and introduced some hacks in the bindings, which you is fully documented in my personal repo: https://github.com/jsoft88/pyspark-conda-k8s.
Basically, it was about modifying the entrypoint.sh provided by the spark docker-image-tool.sh
and adding the required lines for conda environment.
Upvotes: 1