Pengin
Pengin

Reputation: 4772

Spark workflow with jar

I'm trying to understand the extend to which one must compile a jar to use Spark.

I'd normally write ad-hoc analysis code in an IDE, then run it locally against data with a single click (in the IDE). If my experiments with Spark are giving me the right indication then I have to compile my script into a jar, and send it to all the Spark nodes. I.e. my workflow would be

  1. Writing analysis script, which will upload a jar of itself (created below)
  2. Go make the jar.
  3. Run the script.

For ad-hoc iterative work this seems a bit much, and I don't understand how the REPL gets away without it.

Update:

Here's an example, which I couldn't get to work unless I compiled it into a jar and did sc.addJar. But the fact that I must do this seems odd, since there is only plain Scala and Spark code.

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.SparkFiles
import org.apache.spark.rdd.RDD

object Runner {
  def main(args: Array[String]) {
    val logFile = "myData.txt" 
    val conf = new SparkConf()
      .setAppName("MyFirstSpark")
      .setMaster("spark://Spark-Master:7077")

    val sc = new SparkContext(conf)

    sc.addJar("Analysis.jar")
    sc.addFile(logFile)

    val logData = sc.textFile(SparkFiles.get(logFile), 2).cache()

    Analysis.run(logData)
  }
}

object Analysis{
   def run(logData: RDD[String]) {
    val numA = logData.filter(line => line.contains("a")).count()
    val numB = logData.filter(line => line.contains("b")).count()
    println("Lines with 'a': %s, Lines with 'b': %s".format(numA, numB))
  }
}

Upvotes: 3

Views: 262

Answers (3)

Mauricio Bustos
Mauricio Bustos

Reputation: 124

You are creating an anonymous function in the use of 'filter':

scala> (line: String) => line.contains("a")
res0: String => Boolean = <function1>

That function's generated name is not available unless the jar is distributed to the workers. Did the stack trace on the worker highlight a missing symbol?

If you just want to debug locally without having to distribute the jar you could use the 'local' master:

val conf = new SparkConf().setAppName("myApp").setMaster("local")

Upvotes: 1

Priyank Desai
Priyank Desai

Reputation: 3835

You can use SparkContext.jarOfObject(Analysis.getClass) to automatically include the jar that you want to distribute without packaging it yourself.

Find the JAR from which a given class was loaded, to make it easy for users to pass their JARs to SparkContext.

def jarOfClass(cls: Class[_]): Option[String]
def jarOfObject(obj: AnyRef): Option[String]

You want to do something like:

sc.addJar(SparkContext.jarOfObject(Analysis.getClass).get)

HTH!

Upvotes: 1

Holden
Holden

Reputation: 7452

While creating JARs is the most common way of handling long-running Spark jobs, for interactive development work Spark has shells available directly in Scala, Python & R. The current quick start guide ( https://spark.apache.org/docs/latest/quick-start.html ) only mentions the Scala & Python shells, but the SparkR guide discusses how to work with SparkR interactively as well (see https://spark.apache.org/docs/latest/sparkr.html ). Best of luck with your journeys into Spark as you find yourself working with larger datasets :)

Upvotes: 1

Related Questions