Running spark-submit from a machine not in a Hadoop cluster

Question

I am trying to set up a Spark client distribution for our analysts they can use from their desktops.

To achieve this, I added a "pre-built with user-provided Apache Hadoop" version of Spark to my existing Hadoop client distribution. I've tried this both on Windows (the clients are deployed in C:\HadoopClient) and on Linux (the clients are deployed in ~)

I am trying to launch the most basic spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster Spark/examples/jars/spark-examples_2.12-3.1.1.jar and it fails with java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

I've checked the usual suspects:

JAVA_HOME, HADOOP_HOME, SPARK_HOME, HADOOP_CONF_DIR, SPARK_CONF_DIR and PATH are all set correctly
SPARK_DIST_CLASSPATH contains the contents of hadoop classpath

I've checked launch_container.sh on the Hadoop cluster and I've noticed that the CLASSPATH variable looks like this:

$PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/etc/hadoop/conf/*:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-mapreduce/lib/*:/etc/hadoop/conf/*:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-mapreduce/lib/*: followed by the entire contents of my SPARK_DIST_CLASSPATH variable from the standalone machine (either C:\HadoopClient\Hadoop\share... or /home/user/Hadoop/share...).

org/apache/hadoop/fs/FSDataInputStream is from hadoop-common-3.1.2.jar, which should be on the CLASSPATH, as it's located in /usr/lib/hadoop on the cluster nodes. I actually looked inside just in case, and FSDataInputStream.class is there in the right place.

Why can't Yarn find hadoop-common-3.1.2.jar if it's on the classpath?
What is my SPARK_DIST_CLASSPATH doing in the classpath of the Yarn job? Is it harmless waste, or am I doing something wrong?

When I use a different distribution of Spark, the one that is pre-built for Apache Hadoop 3.1.2 (our Hadoop version), the job crashes even quicker, with

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

And the CLASSPATH variable looks like this:

$PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*: followed by the entire contents of my SPARK_DIST_CLASSPATH variable from the standalone machine.

In this case I can agree that there's no hadoop-common-3.1.2.jar on the classpath. The %SPARK_HOME%\jars directory on the standalone machine definitely contains hadoop-common-3.1.2.jar, but it's not present in $PWD/__spark_libs__/ on the cluster.

Why isn't spark-submit sending the jar to the Hadoop cluster?

Running spark-submit from a machine not in a Hadoop cluster

Answers (1)

Related Questions