Reputation: 249
I'm trying to submit my Pyspark application to a Kubernetes cluster (Minikube) using spark-submit:
./bin/spark-submit \
--master k8s://https://192.168.64.4:8443 \
--deploy-mode cluster \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 \
--conf spark.kubernetes.container.image='pyspark:dev' \
--conf spark.kubernetes.container.image.pullPolicy='Never' \
local:///main.py
The application try to reach a Kafka instance deployed inside the cluster, so I specified the jar dependency:
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1
The container image I'm using is based on the one I've built with the utility script. I've packed all my python dependencies that my app need inside it.
The driver correctly deploy and get the Kafka package (I can provide the logs if needed) and launch the executor in a new pod.
But then the executor pod crash:
ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaBatchInputPartition
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1986)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2160)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:407)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
So I did my investigations on the executor pod and found that the jar is not present (as stated in the stack trace) in the $SPARK_CLASSPATH folder (which is set to ':/opt/spark/jars/*')
Do I need to also get and include the dependency in the spark jars folder when building the docker image? (I thought the '--packages' option would also make the executor retrieve the specified jar)
Upvotes: 7
Views: 2488
Reputation: 2383
Did you start out with the official Dockerfile (kubernetes/dockerfiles/spark/bindings/python/Dockerfile) as described in the Docker images section of the documentation? You also need to specify an upload location on a Hadoop-compatible filesystem and make sure that the specified Ivy home and cache directories have the correct permissions, as described in the Dependency Management section.
Example from the docs:
...
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6
--conf spark.kubernetes.file.upload.path=s3a://<s3-bucket>/path
--conf spark.hadoop.fs.s3a.access.key=...
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.hadoop.fs.s3a.secret.key=....
--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp"
...
Upvotes: 2