Reputation: 257
one spark has one oracle query. so I have to run multiple jobs in parallel so that all queries will fire at the same time.
How to run multiple jobs in parallel?
Upvotes: 24
Views: 53762
Reputation: 1026
In Scala, you can run multiple jobs in parallel using parallelized collections. You can parallelize a list using the method .par
:
val myParallelList = List(1,2,3,4).par
When you use a map
or a foreach
on a parallelized list, the executions will be run in parallel.
So you could do something like the following:
List(df1, df2, df3, df4, df5)
.zipWithIndex
.par
.foreach(dfAndIndex =>
dfAndIndex._1.write.parquet("/tmp/dataframe${dfAndIndex._2}")
)
This would result in the 5 dataframes being written in parallel. If you look at the Spark UI, you should see the jobs running at the same time. You can adapt this snippet to run any group of jobs in parallel.
Upvotes: 0
Reputation: 74759
Quoting the official documentation on Job Scheduling:
Second, within each Spark application, multiple "jobs" (Spark actions) may be running concurrently if they were submitted by different threads.
In other words, a single SparkContext
instance can be used by multiple threads that gives the ability to submit multiple Spark jobs that may or may not be running in parallel.
Whether the Spark jobs run in parallel depends on the number of CPUs (Spark does not track the memory usage for scheduling). If there are enough CPUs to handle the tasks from multiple Spark jobs they will be running concurrently.
If however the number of CPUs is not enough you may consider using FAIR scheduling mode (FIFO is the default):
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Just to clear things up a bit.
spark-submit
is to submit a Spark application for execution (not Spark jobs). A single Spark application can have at least one Spark job.
RDD actions may or may not be blocking. SparkContext
comes with two methods to submit (or run) a Spark job, i.e. SparkContext.runJob
and SparkContext.submitJob
, and so it does not really matter whether an action is blocking or not but what SparkContext
method to use to have non-blocking behaviour.
Please note that "RDD action methods" are already written and their implementations use whatever Spark developers bet on (mostly SparkContext.runJob
as in count):
// RDD.count
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
You'd have to write your own RDD actions (on a custom RDD) to have required non-blocking feature in your Spark app.
Upvotes: 38
Reputation: 195
With no further explanation, I assume that by a spark job you mean a Spark action and all tasks needed to evaluate that action? If that is the case, natively you can take a look at the official doc(pay attention to your spark version
): https://spark.apache.org/docs/latest/job-scheduling.html
If that is not the solution you want to follow and you want to do the hackish-not-scalable-way-just-get-things-done, schedule the jobs for the same time in cron
using spark-submit
- https://spark.apache.org/docs/latest/submitting-applications.html
And of course, if you are in a place where you have schedulers, then it is a simple case of looking into its documentation - Oozie, Airflow, Luigi, Nifi and even old-school GUI's like control+m support this.
Upvotes: -2