How to build and run Scala Spark locally

Question

I'm attempting to build Apache Spark locally. Reason for this is to debug Spark methods like reduce. In particular I'm interested in how Spark implements and distributes Map Reduce under the covers as I'm experiencing performance issues and I think running these tasks from source is best method of finding out what the issue is.

So I have cloned the latest from Spark repo :

git clone https://github.com/apache/spark.git

Spark appears to be a Maven project so when I create it in Eclipse here is the structure :

enter image description here

Some of the top level folders also have pom files :

enter image description here

So should I just be building one of these sub projects ? Are these correct steps for running Spark against a local code base ?

maasg · Accepted Answer

Building Spark locally, the short answer:

git clone git@github.com:apache/spark.git
cd spark
sbt/sbt compile

Going in detail into your question, what you're actually asking is 'How to debug a Spark application in Eclipse'. To have debugging in Eclipse, you don't really need to build Spark in Eclipse. All you need is to create a job with its Spark lib dependency and ask Maven 'download sources'. That way you can use the Eclipse debugger to step into the code.

Then, when creating the Spark Context, use sparkConfig.local[1] as master like:

val conf = new SparkConf()
      .setMaster("local[1]")
      .setAppName("SparkDebugExample")

so that all Spark interactions are executed in local mode in one thread and therefore visible to your debugger.

If you are investigating a performance issue, remember that Spark is a distributed system, where network plays an important role. Debugging the system locally will only give you part of the answer. Monitoring the job in the actual cluster will be required in order to have a complete picture of the performance characteristics of your job.

How to build and run Scala Spark locally

Answers (1)

Related Questions