pyspark 2.4.6 on k8s: specifying python version

Question

I am trying to deploy a pyspark application on k8s (with minikube) and followed the instructions from here: https://spark.apache.org/docs/2.4.6/running-on-kubernetes.html

I've built the images using docker tools and pushed it to my registry as well. Later I invoke spark-submit like this:

./bin/spark-submit --master k8s://https://127.0.0.1:49154 --deploy-mode cluster --name pyspark-on-k8s --conf spark.executor.instances=1 --conf spark.kubernetes.driver.container.image=jsoft88/conda_spark:2.4.6 --conf spark.kubernetes.executor.container.image=jsoft88/conda_spark:2.4.6 --conf spark.kubernetes.pyspark.pythonVersion=3 --conf spark.kubernetes.driverEnv.PYSPARK_DRIVER_PYTHON=/opt/miniconda3/envs/spark_env/bin/python --conf spark.kubernetes.driverEnv.PYSPARK_PYTHON=/opt/miniconda3/envs/spark_env/bin/python --conf spark.kubernetes.driverEnv.PYTHON_VERSION=3.7.3 /home/bitnami/spark-sample/app/main/sample_app.py --top 10

The *.driverEnv are just attempts that I made, because by default it is not using this python version, but python 3.8.5, which causes spark to throw an error like this:

++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_K8S_CMD=driver-py
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=$.*$/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '--top 10' ']'
+ PYSPARK_ARGS='--top 10'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' 3 == 2 ']'
+ '[' 3 == 3 ']'
++ python3 -V
+ pyv3='Python 3.8.5'
+ export PYTHON_VERSION=3.8.5
+ PYTHON_VERSION=3.8.5
+ export PYSPARK_PYTHON=python3
+ PYSPARK_PYTHON=python3
+ export PYSPARK_DRIVER_PYTHON=python3
+ PYSPARK_DRIVER_PYTHON=python3
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@" $PYSPARK_PRIMARY $PYSPARK_ARGS)
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner file:/home/bitnami/spark-sample/app/main/sample_app.py --top 10
21/03/04 12:55:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "/home/bitnami/spark-sample/app/main/sample_app.py", line 4, in 
    from pyspark.sql import DataFrame, SparkSession, functions
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 655, in _load_unlocked
  File "", line 618, in _load_backward_compatible
  File "", line 259, in load_module
  File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 51, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 655, in _load_unlocked
  File "", line 618, in _load_backward_compatible
  File "", line 259, in load_module
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 31, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 655, in _load_unlocked
  File "", line 618, in _load_backward_compatible
  File "", line 259, in load_module
  File "/opt/spark/python/lib/pyspark.zip/pyspark/accumulators.py", line 97, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 655, in _load_unlocked
  File "", line 618, in _load_backward_compatible
  File "", line 259, in load_module
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 72, in 
  File "", line 991, in _find_and_load
  File "", line 975, in _find_and_load_unlocked
  File "", line 655, in _load_unlocked
  File "", line 618, in _load_backward_compatible
  File "", line 259, in load_module
  File "/opt/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 145, in 
  File "/opt/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
TypeError: an integer is required (got type bytes)

The idea is to have conda environment inside the container, with the application installed in the environment; so I extended the docker image generated by the docker-image-tool.sh provided in the spark binaries, my dockerfile looks like this:

FROM jsoft88/spark-py:2.4.6

ENV PATH="/opt/miniconda3/bin:${PATH}"
ARG PATH="/opt/miniconda3/bin:${PATH}"

WORKDIR /home/bitnami

RUN apt update -y && apt install wget -y && wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh

RUN chmod +x ./miniconda.sh

RUN ./miniconda.sh -b -f -p /opt/miniconda3

RUN rm -f miniconda.sh

RUN /opt/miniconda3/bin/conda init bash

COPY . /home/bitnami/spark-sample

RUN conda config --add channels conda-forge

RUN conda create --name spark_env --file /home/bitnami/spark-sample/requirements.txt --yes python=3.7.3

RUN . /opt/miniconda3/etc/profile.d/conda.sh && conda activate spark_env && cd /home/bitnami/spark-sample && pip install .

Requirements.txt:

python==3.7.3
pyspark==2.4.6
pytest

pyspark 2.4.6 on k8s: specifying python version

Answers (1)

Related Questions