Reputation: 4772
I'm trying to understand the extend to which one must compile a jar to use Spark.
I'd normally write ad-hoc analysis code in an IDE, then run it locally against data with a single click (in the IDE). If my experiments with Spark are giving me the right indication then I have to compile my script into a jar, and send it to all the Spark nodes. I.e. my workflow would be
For ad-hoc iterative work this seems a bit much, and I don't understand how the REPL gets away without it.
Update:
Here's an example, which I couldn't get to work unless I compiled it into a jar and did sc.addJar
. But the fact that I must do this seems odd, since there is only plain Scala and Spark code.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.SparkFiles
import org.apache.spark.rdd.RDD
object Runner {
def main(args: Array[String]) {
val logFile = "myData.txt"
val conf = new SparkConf()
.setAppName("MyFirstSpark")
.setMaster("spark://Spark-Master:7077")
val sc = new SparkContext(conf)
sc.addJar("Analysis.jar")
sc.addFile(logFile)
val logData = sc.textFile(SparkFiles.get(logFile), 2).cache()
Analysis.run(logData)
}
}
object Analysis{
def run(logData: RDD[String]) {
val numA = logData.filter(line => line.contains("a")).count()
val numB = logData.filter(line => line.contains("b")).count()
println("Lines with 'a': %s, Lines with 'b': %s".format(numA, numB))
}
}
Upvotes: 3
Views: 262
Reputation: 124
You are creating an anonymous function in the use of 'filter':
scala> (line: String) => line.contains("a")
res0: String => Boolean = <function1>
That function's generated name is not available unless the jar is distributed to the workers. Did the stack trace on the worker highlight a missing symbol?
If you just want to debug locally without having to distribute the jar you could use the 'local' master:
val conf = new SparkConf().setAppName("myApp").setMaster("local")
Upvotes: 1
Reputation: 3835
You can use SparkContext.jarOfObject(Analysis.getClass) to automatically include the jar that you want to distribute without packaging it yourself.
Find the JAR from which a given class was loaded, to make it easy for users to pass their JARs to SparkContext.
def jarOfClass(cls: Class[_]): Option[String]
def jarOfObject(obj: AnyRef): Option[String]
You want to do something like:
sc.addJar(SparkContext.jarOfObject(Analysis.getClass).get)
HTH!
Upvotes: 1
Reputation: 7452
While creating JARs is the most common way of handling long-running Spark jobs, for interactive development work Spark has shells available directly in Scala, Python & R. The current quick start guide ( https://spark.apache.org/docs/latest/quick-start.html ) only mentions the Scala & Python shells, but the SparkR guide discusses how to work with SparkR interactively as well (see https://spark.apache.org/docs/latest/sparkr.html ). Best of luck with your journeys into Spark as you find yourself working with larger datasets :)
Upvotes: 1