Gaps Sunnskyfsynn
Gaps Sunnskyfsynn

Reputation: 1

Spark on AWS EKS java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found when running in cluster mode

I am trying to run a spark job on an EKS cluster. When I run it in cluster mode I received the following

 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
        at org.apache.spark.util.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:317)
        at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:273)
        at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:271)
        at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
        at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
        at org.apache.spark.util.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:271)
        at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$5(SparkSubmit.scala:393)
        at scala.Option.map(Option.scala:230)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:393)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:964)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
        ... 27 more

I am originating from an EC2 instance. The spark-submit looks as follows:

spark-3.5.3-bin-hadoop3/bin/spark-submit \
--master k8s://https://aws.cluster:443 \
--deploy-mode cluster \
--name test1 \
--verbose \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.endpoint=s3.region.amazonaws.com \
--conf spark.hadoop.fs.s3a.access.key=access_key \
--conf spark.hadoop.fs.s3a.secret.key=secret_key \
--conf spark.kubernetes.container.image=ecr/spark-py:3.5.3 \
--conf spark.driver.extraClassPath="/opt/spark/jars/hadoop-aws-3.3.4.jar:/opt/spark/jars/aws-java-sdk-bundle-1.12.180.jar" \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--files 's3a://bucket/file1.jpg,s3a://bucket/file2.jpg' \
--py-files s3a://bucket/py-files.zip \
s3a://bucket/spark-application.py

When I submit the spark job, the cluster spins up the driver and that is where I am receiving the error. I have made sure the appropriate jars are in the $SPARK_CLASSPATH and that the jars are the correct version.

When I shell into the exact same container that is being spun up on the EKS node and run the exact same spark job in client mode instead of cluster mode I do not receive the error and the job runs successfully. Here is an example of the spark submit.

docker run -it ecr/spark-py:3.5.3 /bin/bash
/opt/spark/bin/spark-submit \
--deploy-mode client \
--name test1 \
--verbose \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.endpoint=s3.region.amazonaws.com \
--conf spark.hadoop.fs.s3a.access.key=access_key \
--conf spark.hadoop.fs.s3a.secret.key=secret_key \
--conf spark.kubernetes.container.image=ecr/spark-py:3.5.3 \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--files 's3a://bucket/file1.jpg,s3a://bucket/file2.jpg' \
--py-files s3a://bucket/py-files.zip \
s3a://bucket/spark-application.py

and here is the results


24/12/16 18:56:54 INFO Utils: Successfully started service 'SparkUI' on port 4040.
24/12/16 18:56:54 INFO SparkContext: Added file s3a://bucket/file1.jpg at s3a://bucket/file1.jpg with timestamp 1734375412862
24/12/16 18:56:54 INFO Utils: Fetching s3a://bucket/file1.jpg to /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/fetchFileTemp16712750277348807188.tmp
24/12/16 18:56:54 INFO SparkContext: Added file s3a://bucket/file2.jpg at s3a://bucket/file2.jpg with timestamp 1734375412862
24/12/16 18:56:54 INFO Utils: Fetching s3a://bucket/file2.jpg to /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/fetchFileTemp9918504351183826242.tmp
24/12/16 18:56:55 INFO SparkContext: Added file s3a://bucket/pyfiles.zip at s3a://bucket/pyfiles.zip with timestamp 1734375412862
24/12/16 18:56:55 INFO Utils: Fetching s3a://bucket/pyfiles.zip to /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/fetchFileTemp2697949075798797977.tmp
24/12/16 18:56:55 INFO Executor: Starting executor ID driver on host 1103cb8139a4
24/12/16 18:56:55 INFO Executor: OS info Linux, 6.8.0-1019-aws, amd64
24/12/16 18:56:55 INFO Executor: Java version 17.0.13
24/12/16 18:56:55 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''
24/12/16 18:56:55 INFO Executor: Created or updated repl class loader org.apache.spark.util.MutableURLClassLoader@51434498 for default.
24/12/16 18:56:55 INFO Executor: Fetching s3a://bucket/pyfiles.zip with timestamp 1734375412862
24/12/16 18:56:55 INFO Utils: Fetching s3a://bucket/pyfiles.zip to /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/fetchFileTemp412463927524377149.tmp
24/12/16 18:56:55 INFO Utils: /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/fetchFileTemp412463927524377149.tmp has been previously copied to /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/test-spark_files_asn.zip
24/12/16 18:56:55 INFO Executor: Fetching s3a://bucket/file1.jpg with timestamp 1734375412862
24/12/16 18:56:55 INFO Utils: Fetching s3a://bucket/file1.jpg to /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/fetchFileTemp4019774831140707035.tmp
24/12/16 18:56:55 INFO Utils: /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/fetchFileTemp4019774831140707035.tmp has been previously copied to /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/GeoLite2-ASN.mmdb
24/12/16 18:56:55 INFO Executor: Fetching s3a://bucket/file2.jpg with timestamp 1734375412862
24/12/16 18:56:55 INFO Utils: Fetching s3a://bucket/file2.jpg to /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/fetchFileTemp11601213107382369682.tmp
24/12/16 18:56:55 INFO Utils: /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e83b/userFiles-c9ca1323-4933-4cec-9066-c4437830d6dc/fetchFileTemp11601213107382369682.tmp has been previously copied to /tmp/spark-83e76e50-dab8-4f2e-9138-d58d1f69e8

I do not understand why it works in client mode on the container but does not work in cluster mode on the same container image.

Upvotes: 0

Views: 34

Answers (2)

Kashyap
Kashyap

Reputation: 17504

Provide --package option to spark-submit and it'll install required jars to driver/executors, e.g.:

spark-submit \
--packages org.apache.hadoop:hadoop-aws:3.3.4 \
# ... all your other options ...
# --conf spark.driver.extraClassPath   # not needed
--py-files s3a://bucket/py-files.zip \
s3a://bucket/spark-application.py

OR try one of the other 100 ways to make the hadoop-aws jars available on driver/executors.

Upvotes: 0

Ali BOUHLEL
Ali BOUHLEL

Reputation: 600

In cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar won’t work out of the box with files that are local to the client. To make files on the client available to SparkContext.addJar, include them with the --jars option in the launch command.

SOURCE

Upvotes: 0

Related Questions