Reputation: 954
I have a report stats project which use spark 2.1(scala),here is how it works:
object PtStatsDayApp extends App {
Stats A...
Stats B...
Stats C...
.....
}
someone put many stat computation(mostly not related) in one class and submit it using shell. I find it has two problems:
if one stat stuck then the other stats below can not run
if one stat failed then the application will rerun from the beginning
I have two refactor solutions:
Any other idea or best practice? thanks
Upvotes: 0
Views: 653
Reputation: 3470
First you can set the scheduler mode to FAIR
. Then you can use parallel collections to launch simultaneous Spark jobs on a multithreaded driver.
A parallel collection, lets say... a Parallel Sequence ParSeq
of ten of your Stats
queries, can use a foreach
to fire off each of the Stats
queries one by one. It will depend on how many cores the driver has as to how many threads you can use aimultaneously. By default, the global execution context has that many threads.
Check out these posts they are examples of launching concurrent spark jobs with parallel collections.
Cache and Query a Dataset In Parallel Using Spark
Launching Apache Spark SQL jobs from multi-threaded driver
Upvotes: 1
Reputation: 1339
There are several 3d party free Spark schedulers like Airflow, but I suggest to use Spark Launcher API and write a launching logic programmatically. With this API you can run your jobs in paralel, sequentially or whatever you want.
Link to doc: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/launcher/package-summary.html
Efficiency of running your jobs in parallel mostly depends on your Spark Cluster configuration. In general Spark supports such kind of workloads.
Upvotes: 1