XeroxDucati
XeroxDucati

Reputation: 5200

How do you submit a job to Spark from code?

I've spun up a single-node standalone Spark cluster and confirmed my build works with ./bin/run-example SparkPi 10. Then I wrote a really simple test project in scala;

import org.apache.spark.{SparkConf, SparkContext}

object Main {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf()
    val sc = new SparkContext("spark://UbuntuVM:7077", "Simple Application")

    val count = sc.parallelize(1 to 100).map{i =>
      val x = Math.random()
      val y = Math.random()
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / 100)
  }
}

I'm running this from inside my IDE (IntelliJ). It connects to the cluster successfully, and I see it submit jobs, but they all throw an error;

INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7) on executor 192.168.1.233: java.lang.ClassNotFoundException (Main$$anonfun$1) [duplicate 7]

If I understand spark correctly, this is because the cluster can't find the code. So how do I feed the code to spark? I'm not running HDFS or anything in this test, but it's all on one box, so I would have expected SparkContext to pass the current directory to spark, but it apparently does not.

Can anyone point me at the right way to set this up?

Upvotes: 3

Views: 2820

Answers (3)

gszecsenyi
gszecsenyi

Reputation: 93

Here is the Class.

https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/launcher/package-summary.html

    import org.apache.spark.launcher.SparkLauncher;

   public class MyLauncher {
     public static void main(String[] args) throws Exception {
       Process spark = new SparkLauncher()
         .setAppResource("/my/app.jar")
         .setMainClass("my.spark.app.Main")
         .setMaster("local")
         .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
         .launch();
       spark.waitFor();
     }
   }

Upvotes: 1

WestCoastProjects
WestCoastProjects

Reputation: 63221

You are missing a crucial step:

org.apache.spark.deploy.SparkSubmit

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

which actually submits a job to the cluster. Unfortunately there is not presently a solid working wrapper for it except for spark-submit. So there presently is not a reliable way to programatically submit spark jobs. There is a jira for it that was partially addressed in Feb 2015: but it lacks documentation.

 https://github.com/apache/spark/pull/3916/files

The difficulty lies in the complexity of the environmental machinations provided by spark-submit. It has not been found possible to replicate them solely within scala/java code.

Upvotes: 1

pishen
pishen

Reputation: 419

If you want to test your Spark program locally, you don't even need to spin up the single-node standalone Spark. Just set your master url to local[*] like this

val sc = new SparkContext("local[*]", "Simple Application", sparkConf)

Then in sbt, type > run to run your program (this should be the same as running from IntelliJ, but I used to run the program from terminal using sbt).

Since you may not want to change your master url in code between local[*] and spark://... many times, you can leave them blank

val sc = new SparkContext(new SparkConf())

and set your java properties when running, for example, in build.sbt, you can add

javaOptions := Seq("-Dspark.master=local[*]", "-Dspark.app.name=my-app")

and run it using run in sbt.


To make a more comprehensive local-mode experience, you may want to add the following lines in your build.sbt

run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
runMain in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
fork := true
javaOptions := Seq("-Dspark.master=local[*]", s"-Dspark.app.name=my-app")
outputStrategy := Some(StdoutOutput)

We have created a sbt plugin which can add these settings for you, it can also help you deploy a standalone Spark cluster on cloud system like aws ec2, give a look at spark-deployer if you are interested.

Upvotes: 1

Related Questions