Reputation: 5618
I am relatively new to Spark. I have created scripts which are run for offline, batch processing tasks but now I am attempting to use it from within a live application (i.e. not using spark-submit
).
My application looks like the following:
The fact that the job is initiated by an HTTP request, and that the spark-job is dependant on data in this request, seems to imply that I can't use spark-submit
for this problem. Although this originally seemed fairly straightforward to me, I've run into some issues that no amount of Googling has been able to solve. The main issue is exemplified by the following:
object MyApplication {
def main(args: Array[String]): Unit = {
val config = ConfigFactory.load()
val sparkConf = new SparkConf(true)
.setMaster(config.getString("spark.master-uri"))
.setAppName(config.getString("spark.app-name"))
val sc = new SparkContext(sparkConf)
val localData = Seq.fill(10000)(1)
// The following produces ClassNotFoundException for Test
val data = sc.parallelize[Int](localData).map(i => new Test(i))
println(s"My Result ${data.count()}")
sc.stop()
}
}
class Test(num: Int) extends Serializable
Which is then run
sbt run
...
[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 127.0.0.1): java.lang.ClassNotFoundException: my.package.Test
...
I have classes within my application that I want to create and manipulate while using Spark. I've considered extracting all of these classes into a stand-alone library which I can send using .setJars("my-library-of-classes")
but that will be very awkward to work with and deploy. I've also considered not using any classes within the spark logic and only using tuples. That would be just as awkward to use.
Is there a way for me to do this? Have I missed something obvious or are we really forced to use spark-submit
to communicate with Spark?
Upvotes: 1
Views: 172
Reputation: 11
Currently spark-submit is the only way to submit spark jobs to a cluster. So you have to create a fat jar consisting of all your classes and dependencies and use it with spark-submit. You could use command line arguments to pass either data or paths to configuration files
Upvotes: 1