Reputation: 1434
I am trying to set up a Spark client distribution for our analysts they can use from their desktops.
To achieve this, I added a "pre-built with user-provided Apache Hadoop" version of Spark to my existing Hadoop client distribution. I've tried this both on Windows (the clients are deployed in C:\HadoopClient
) and on Linux (the clients are deployed in ~
)
I am trying to launch the most basic spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster Spark/examples/jars/spark-examples_2.12-3.1.1.jar
and it fails with java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
I've checked the usual suspects:
hadoop classpath
I've checked launch_container.sh
on the Hadoop cluster and I've noticed that the CLASSPATH variable looks like this:
$PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/etc/hadoop/conf/*:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-mapreduce/lib/*:/etc/hadoop/conf/*:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-mapreduce/lib/*:
followed by the entire contents of my SPARK_DIST_CLASSPATH variable from the standalone machine (either C:\HadoopClient\Hadoop\share...
or /home/user/Hadoop/share...
).
org/apache/hadoop/fs/FSDataInputStream
is from hadoop-common-3.1.2.jar
, which should be on the CLASSPATH, as it's located in /usr/lib/hadoop
on the cluster nodes. I actually looked inside just in case, and FSDataInputStream.class
is there in the right place.
hadoop-common-3.1.2.jar
if it's on the classpath?When I use a different distribution of Spark, the one that is pre-built for Apache Hadoop 3.1.2 (our Hadoop version), the job crashes even quicker, with
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
And the CLASSPATH variable looks like this:
$PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:
followed by the entire contents of my SPARK_DIST_CLASSPATH variable from the standalone machine.
In this case I can agree that there's no hadoop-common-3.1.2.jar
on the classpath. The %SPARK_HOME%\jars
directory on the standalone machine definitely contains hadoop-common-3.1.2.jar
, but it's not present in $PWD/__spark_libs__/
on the cluster.
Upvotes: 1
Views: 485
Reputation: 1434
Turns out the problem was in spark-defaults.conf that I had in SPARK_CONF_DIR. I copied the whole conf
directory from the cluster node to access Hive, but spark-defaults.conf was tuned to be run from the cluster node and overrode spark.yarn.archives
. I removed the file from the configuration and was able to submit the job to the Yarn cluster successfully.
Upvotes: 1