Reputation: 5200
I've spun up a single-node standalone Spark cluster and confirmed my build works with ./bin/run-example SparkPi 10
. Then I wrote a really simple test project in scala;
import org.apache.spark.{SparkConf, SparkContext}
object Main {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
val sc = new SparkContext("spark://UbuntuVM:7077", "Simple Application")
val count = sc.parallelize(1 to 100).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / 100)
}
}
I'm running this from inside my IDE (IntelliJ). It connects to the cluster successfully, and I see it submit jobs, but they all throw an error;
INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7) on executor 192.168.1.233: java.lang.ClassNotFoundException (Main$$anonfun$1) [duplicate 7]
If I understand spark correctly, this is because the cluster can't find the code. So how do I feed the code to spark? I'm not running HDFS or anything in this test, but it's all on one box, so I would have expected SparkContext to pass the current directory to spark, but it apparently does not.
Can anyone point me at the right way to set this up?
Upvotes: 3
Views: 2820
Reputation: 93
Here is the Class.
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/launcher/package-summary.html
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
Upvotes: 1
Reputation: 63221
You are missing a crucial step:
org.apache.spark.deploy.SparkSubmit
which actually submits a job to the cluster. Unfortunately there is not presently a solid working wrapper for it except for spark-submit
. So there presently is not a reliable way to programatically submit spark jobs. There is a jira for it that was partially addressed in Feb 2015: but it lacks documentation.
https://github.com/apache/spark/pull/3916/files
The difficulty lies in the complexity of the environmental machinations provided by spark-submit
. It has not been found possible to replicate them solely within scala/java code.
Upvotes: 1
Reputation: 419
If you want to test your Spark program locally, you don't even need to spin up the single-node standalone Spark. Just set your master url to local[*]
like this
val sc = new SparkContext("local[*]", "Simple Application", sparkConf)
Then in sbt, type > run
to run your program (this should be the same as running from IntelliJ, but I used to run the program from terminal using sbt).
Since you may not want to change your master url in code between local[*]
and spark://...
many times, you can leave them blank
val sc = new SparkContext(new SparkConf())
and set your java properties when running, for example, in build.sbt
, you can add
javaOptions := Seq("-Dspark.master=local[*]", "-Dspark.app.name=my-app")
and run it using run
in sbt.
To make a more comprehensive local-mode experience, you may want to add the following lines in your build.sbt
run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
runMain in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
fork := true
javaOptions := Seq("-Dspark.master=local[*]", s"-Dspark.app.name=my-app")
outputStrategy := Some(StdoutOutput)
We have created a sbt plugin which can add these settings for you, it can also help you deploy a standalone Spark cluster on cloud system like aws ec2, give a look at spark-deployer if you are interested.
Upvotes: 1