How to submit Spark jobs generated at runtime?

Question

So for my use case I need to create and submit Spark streaming jobs at runtime. Having done some googling, I don't think that there's a simple way of executing Spark jobs without creating the jar file first...

My idea is to make a builder-like abstraction over Spark/Scala code, configure it at runtime by just injecting the relevant objects, and then convert that abstraction to actual raw Scala code and write it to disk.

Then I would use ProcessBuilder or something to run an sbt package on that Scala code and build the jar that way. Then I should be able to submit the job programmatically using either SparkLauncher or, again, ProcessBuilder by just running the spark-submit command.

This all seems a little idiotic, if I'm honest. Does anyone have any better ideas of submitting jobs programmatically?

The downside of using SparkLauncher is that I'll have to prepackage a huge Spark job jar with all the functionality that it can possibly do. I could then submit it using SparkLauncher and supply it with the relevant -D arguments to fix the particular functionality at runtime.

Jacek Laskowski · Accepted Answer

I've got a client with the requirement once and what worked fine was to create a generic Spark application that accepts parameters that could specify the lower-level configuration details like a ML algorithm. With that generic Spark application you'd use SparkLauncher to submit it for execution (where you'd specify the master URL and deployment-specific parameters).

Actually, if you're for Spark MLlib and the different ML algorithms that Spark supports, things are pretty easy to abstract away from the generic Spark application as you could write a ML pipeline that does the pre-processing and select the estimator (the algorithm) by name, possibly the class name.

You could also split the preprocessing part (Spark SQL / ML Transformers) and the main ML pipeline to two separate classes that the main generic Spark application would use.

See Spark MLlib's ML Pipelines in the official documentation.

Since you're worried about...

The downside of using SparkLauncher is that I'll have to prepackage a huge Spark job jar with all the functionality that it can possibly do.

I doubt that it's an issue. Without looking at the requirements first it's hard to say how huge your Spark application jar is going to be, but again if it's about Spark MLlib I'm sure the ML Pipeline feature would cut the lines to the minimum.

Janino

You could also consider generating the code on the fly as Spark SQL does in WholeStageCodegenExec and other physical operators.

Spark SQL uses Janino compiler for code generation so reviewing that part of Spark would show you another (very low-level) way of doing code compilation at runtime that would give you the most flexibility.

The downside is that reviewing or testing the code to generate the final code might be a lot of work and very few people would help you with that.

Speaking of this impure and very imperative Janino compiler world triggered this thinking about using higher-order abstractions like tagless final or similar. I digress.

How to submit Spark jobs generated at runtime?

Answers (2)

Janino

Related Questions