Alexey
Alexey

Reputation: 1434

Running spark-submit from a machine not in a Hadoop cluster

I am trying to set up a Spark client distribution for our analysts they can use from their desktops.

To achieve this, I added a "pre-built with user-provided Apache Hadoop" version of Spark to my existing Hadoop client distribution. I've tried this both on Windows (the clients are deployed in C:\HadoopClient) and on Linux (the clients are deployed in ~)

I am trying to launch the most basic spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster Spark/examples/jars/spark-examples_2.12-3.1.1.jar and it fails with java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

I've checked the usual suspects:

I've checked launch_container.sh on the Hadoop cluster and I've noticed that the CLASSPATH variable looks like this:

$PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/etc/hadoop/conf/*:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-mapreduce/lib/*:/etc/hadoop/conf/*:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-mapreduce/lib/*: followed by the entire contents of my SPARK_DIST_CLASSPATH variable from the standalone machine (either C:\HadoopClient\Hadoop\share... or /home/user/Hadoop/share...).

org/apache/hadoop/fs/FSDataInputStream is from hadoop-common-3.1.2.jar, which should be on the CLASSPATH, as it's located in /usr/lib/hadoop on the cluster nodes. I actually looked inside just in case, and FSDataInputStream.class is there in the right place.

  1. Why can't Yarn find hadoop-common-3.1.2.jar if it's on the classpath?
  2. What is my SPARK_DIST_CLASSPATH doing in the classpath of the Yarn job? Is it harmless waste, or am I doing something wrong?

When I use a different distribution of Spark, the one that is pre-built for Apache Hadoop 3.1.2 (our Hadoop version), the job crashes even quicker, with

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

And the CLASSPATH variable looks like this:

$PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*: followed by the entire contents of my SPARK_DIST_CLASSPATH variable from the standalone machine.

In this case I can agree that there's no hadoop-common-3.1.2.jar on the classpath. The %SPARK_HOME%\jars directory on the standalone machine definitely contains hadoop-common-3.1.2.jar, but it's not present in $PWD/__spark_libs__/ on the cluster.

  1. Why isn't spark-submit sending the jar to the Hadoop cluster?

Upvotes: 1

Views: 485

Answers (1)

Alexey
Alexey

Reputation: 1434

Turns out the problem was in spark-defaults.conf that I had in SPARK_CONF_DIR. I copied the whole conf directory from the cluster node to access Hive, but spark-defaults.conf was tuned to be run from the cluster node and overrode spark.yarn.archives. I removed the file from the configuration and was able to submit the job to the Yarn cluster successfully.

Upvotes: 1

Related Questions