Different behaviour running Scala in Spark on Yarn mode between SBT and spark-submit

Question

I'm currently trying to execute some Scala code with Apache Spark on yarn(-client) mode against a Cloudera cluster, but the sbt run execution is aborted by the following Java Exception:

[error] (run-main-0) org.apache.spark.SparkException: YARN mode not available ?
org.apache.spark.SparkException: YARN mode not available ?
        at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1267)
        at org.apache.spark.SparkContext.(SparkContext.scala:199)
        at org.apache.spark.SparkContext.(SparkContext.scala:100)
        at SimpleApp$.main(SimpleApp.scala:7)
        at SimpleApp.main(SimpleApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)

Caused by: java.lang.ClassNotFoundException: org.apache.spark.scheduler.cluster.YarnClientClusterScheduler
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:191)
        at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1261)
        at org.apache.spark.SparkContext.(SparkContext.scala:199)
        at org.apache.spark.SparkContext.(SparkContext.scala:100)
        at SimpleApp$.main(SimpleApp.scala:7)
        at SimpleApp.main(SimpleApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)

[trace] Stack trace suppressed: run last compile:run for the full output.
java.lang.RuntimeException: Nonzero exit code: 1
        at scala.sys.package$.error(package.scala:27)

[trace] Stack trace suppressed: run last compile:run for the full output.
[error] (compile:run) Nonzero exit code: 1
15/11/24 17:18:03 INFO network.ConnectionManager: Selector thread was interrupted!
[error] Total time: 38 s, completed 24-nov-2015 17:18:04

I suppose the prebuilt Apache Spark distribution is built with yarn support because if I try to execute a spark-submit (yarn-client) mode, there's no any java exception anymore, but yarn does not seem to allocate any resource as I get the same message every second : INFO Client: Application report for application_1448366262851_0022 (state: ACCEPTED). I suppose because of a configuration issue.

I googled this last message but I can't understand what's the yarn (nor where) configuration I have to modify to execute my program with spark on yarn.

Context:

Platform Operating System : Windows 7 x64 Pro
Developments are done under Scala IDE (eclipse), and built with SBT 0.13.9 and Scala 2.10.4.
Hadoop client is the Apache Hadoop 2.6.0, compiled under Windows 7 with 64Bits architecture
HDFS and MapReduce codes developed and executed from the MS Windows platform are successfully executed HDFS and Yarn Client Configuration have been deployed to the Windows platform.
The Spark software used is the prebuilt version of Apache Spark 1.3.0 for Hadoop 2.4+, available at spark.apache.org, but Apache does not specify if it’s built with yarn support or not

Scala Test Program:

Basic Scala program to count lines in a local text file in which a specified word appears. It works when Spark is executed in local mode

UPDATE

Well, the SBT job failed because hadoop-client.jar and spark-yarn.jar were not in the classpath when packaged and executed by SBT.

Now, sbt run is asking for an environment variable SPARK_YARN_APP_JAR and SPARK_JAR with my build.sbt configured like this :

name := "File Searcher"
version := "1.0"
scalaVersion := "2.10.4"
librearyDependencies += "org.apache.spark" %% "spark-core" % "0.9.1"
libraryDependencies += "org.apache.spark" %% "spark-yarn" % "0.9.1" % "runtime"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0" % "runtime"
libraryDependencies += "org.apache.hadoop" % "hadoop-yarn-client" % "2.6.0" % "runtime"
resolvers += "Maven Central" at "https://repo1.maven.org/maven2"

Is there any way to configure these variables "automatically"? I mean, I can set SPARK_JAR, as this jar came with the Spark installation, but SPARK_YARN_APP_JAR? When I set manually those variables, I notice the spark motor doesn't consider my custom configuration, even if I set the YARN_CONF_DIR variable. Is there a way to tell SBT to use my local Spark configuration to work?

If it can help, I let the current (ugly) code I'm executing :

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "src/data/sample.txt"
    val sc = new SparkContext("yarn-client", "Simple App", "C:/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar",
      List("target/scala-2.10/file-searcher_2.10-1.0.jar"))
    val logData = sc.textFile(logFile, 2).cache()
    val numTHEs = logData.filter(line => line.contains("the")).count()
    println("Lines with the: %s".format(numTHEs))
  }
}

Thanks!

Thanks Cheloute

Cheloute · Accepted Answer

Well, I finally found what was my issue.

Firstly, the SBT project must include spark-core and spark-yarn as runtime dependencies.
Next, the Windows yarn-site.xml must specify as yarn classpath the Cloudera cluster shared classpath (classpath valid on the linux nodes) instead of the Windows classpath. It makes Yarn Resource Manager know where are its stuff, even when executed from Windows.
Finally, delete the topology.py section from the Windows core-site.xml file, to avoid Spark trying to execute it, it doesn't need it to work.
Don't forget to delete any mapred-site.xml to use Yarn/MR2 if needed, and specify all the spark properties actually defined in spark-defaults.conf when using a spark-submit to the command line to run it.

That's it. Everything else should work.

Different behaviour running Scala in Spark on Yarn mode between SBT and spark-submit

Answers (1)

Related Questions