Cheloute
Cheloute

Reputation: 893

Different behaviour running Scala in Spark on Yarn mode between SBT and spark-submit

I'm currently trying to execute some Scala code with Apache Spark on yarn(-client) mode against a Cloudera cluster, but the sbt run execution is aborted by the following Java Exception:

[error] (run-main-0) org.apache.spark.SparkException: YARN mode not available ?
org.apache.spark.SparkException: YARN mode not available ?
        at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1267)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:199)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:100)
        at SimpleApp$.main(SimpleApp.scala:7)
        at SimpleApp.main(SimpleApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)

Caused by: java.lang.ClassNotFoundException: org.apache.spark.scheduler.cluster.YarnClientClusterScheduler
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:191)
        at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1261)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:199)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:100)
        at SimpleApp$.main(SimpleApp.scala:7)
        at SimpleApp.main(SimpleApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)

[trace] Stack trace suppressed: run last compile:run for the full output.
java.lang.RuntimeException: Nonzero exit code: 1
        at scala.sys.package$.error(package.scala:27)

[trace] Stack trace suppressed: run last compile:run for the full output.
[error] (compile:run) Nonzero exit code: 1
15/11/24 17:18:03 INFO network.ConnectionManager: Selector thread was interrupted!
[error] Total time: 38 s, completed 24-nov-2015 17:18:04

I suppose the prebuilt Apache Spark distribution is built with yarn support because if I try to execute a spark-submit (yarn-client) mode, there's no any java exception anymore, but yarn does not seem to allocate any resource as I get the same message every second : INFO Client: Application report for application_1448366262851_0022 (state: ACCEPTED). I suppose because of a configuration issue.

I googled this last message but I can't understand what's the yarn (nor where) configuration I have to modify to execute my program with spark on yarn.

Context:

Scala Test Program:

UPDATE

Well, the SBT job failed because hadoop-client.jar and spark-yarn.jar were not in the classpath when packaged and executed by SBT.

Now, sbt run is asking for an environment variable SPARK_YARN_APP_JAR and SPARK_JAR with my build.sbt configured like this :

name := "File Searcher"
version := "1.0"
scalaVersion := "2.10.4"
librearyDependencies += "org.apache.spark" %% "spark-core" % "0.9.1"
libraryDependencies += "org.apache.spark" %% "spark-yarn" % "0.9.1" % "runtime"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0" % "runtime"
libraryDependencies += "org.apache.hadoop" % "hadoop-yarn-client" % "2.6.0" % "runtime"
resolvers += "Maven Central" at "https://repo1.maven.org/maven2"

Is there any way to configure these variables "automatically"? I mean, I can set SPARK_JAR, as this jar came with the Spark installation, but SPARK_YARN_APP_JAR? When I set manually those variables, I notice the spark motor doesn't consider my custom configuration, even if I set the YARN_CONF_DIR variable. Is there a way to tell SBT to use my local Spark configuration to work?

If it can help, I let the current (ugly) code I'm executing :

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "src/data/sample.txt"
    val sc = new SparkContext("yarn-client", "Simple App", "C:/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar",
      List("target/scala-2.10/file-searcher_2.10-1.0.jar"))
    val logData = sc.textFile(logFile, 2).cache()
    val numTHEs = logData.filter(line => line.contains("the")).count()
    println("Lines with the: %s".format(numTHEs))
  }
}

Thanks!

Thanks Cheloute

Upvotes: 0

Views: 1328

Answers (1)

Cheloute
Cheloute

Reputation: 893

Well, I finally found what was my issue.

  • Firstly, the SBT project must include spark-core and spark-yarn as runtime dependencies.
  • Next, the Windows yarn-site.xml must specify as yarn classpath the Cloudera cluster shared classpath (classpath valid on the linux nodes) instead of the Windows classpath. It makes Yarn Resource Manager know where are its stuff, even when executed from Windows.
  • Finally, delete the topology.py section from the Windows core-site.xml file, to avoid Spark trying to execute it, it doesn't need it to work.
  • Don't forget to delete any mapred-site.xml to use Yarn/MR2 if needed, and specify all the spark properties actually defined in spark-defaults.conf when using a spark-submit to the command line to run it.

That's it. Everything else should work.

Upvotes: 0

Related Questions