Jake Greene
Jake Greene

Reputation: 5618

Using Spark From Within My Application

I am relatively new to Spark. I have created scripts which are run for offline, batch processing tasks but now I am attempting to use it from within a live application (i.e. not using spark-submit).

My application looks like the following:

  1. An HTTP request will be received to initiate a data-processing tasks. The payload of the request will contain data needed for the spark job (time window to look at, data types to ignore, weights to use for ranking algorithms, etc).
  2. Processing begins, which will eventually produce a result. The result is expected to fit into memory.
  3. The result is sent to an endpoint (Written to a single file, sent to a client through Web Socket or SSE, etc)

The fact that the job is initiated by an HTTP request, and that the spark-job is dependant on data in this request, seems to imply that I can't use spark-submit for this problem. Although this originally seemed fairly straightforward to me, I've run into some issues that no amount of Googling has been able to solve. The main issue is exemplified by the following:

object MyApplication {
  def main(args: Array[String]): Unit = {

    val config = ConfigFactory.load()

    val sparkConf = new SparkConf(true)
    .setMaster(config.getString("spark.master-uri"))
    .setAppName(config.getString("spark.app-name"))

    val sc = new SparkContext(sparkConf)
    val localData = Seq.fill(10000)(1)
    // The following produces ClassNotFoundException for Test
    val data = sc.parallelize[Int](localData).map(i => new Test(i))
    println(s"My Result ${data.count()}")
    sc.stop()
  }
}

class Test(num: Int) extends Serializable

Which is then run

sbt run
...
[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 127.0.0.1): java.lang.ClassNotFoundException: my.package.Test 
...

I have classes within my application that I want to create and manipulate while using Spark. I've considered extracting all of these classes into a stand-alone library which I can send using .setJars("my-library-of-classes") but that will be very awkward to work with and deploy. I've also considered not using any classes within the spark logic and only using tuples. That would be just as awkward to use.

Is there a way for me to do this? Have I missed something obvious or are we really forced to use spark-submit to communicate with Spark?

Upvotes: 1

Views: 172

Answers (1)

abon
abon

Reputation: 11

Currently spark-submit is the only way to submit spark jobs to a cluster. So you have to create a fat jar consisting of all your classes and dependencies and use it with spark-submit. You could use command line arguments to pass either data or paths to configuration files

Upvotes: 1

Related Questions