Reputation: 9047
I am working on writing a framework that basically does a data sanity check. I have a set of inputs like
{
"check_1": [
sql_query_1,
sql_query_2
],
"check_2": [
sql_query_1,
sql_query_2
],
"check_3": [
sql_query_1,
sql_query_2
]
.
.
.
"check_100": [
sql_query_1,
sql_query_2
]
}
As you can see, there are 100 checks, and each check is comprised of at most 2 SQL queries. The idea is we get the data from the SQL queries and do some diff for data quality check.
Currently, I am running check_1, then check_2, and so on. Which is very slow. I tried to use joblib
library to parallelize the task but got an erroneous result. Also, come to know it is not a good idea to use multithreading in pyspark.
How can I achieve parallelism here? My idea is to
NOTE: Fair schedular is on in spark
Upvotes: 0
Views: 2771
Reputation: 29437
Just run each of the 100 checks as a separate Spark job and the fair scheduler should take care of sharing all available resources (memory/CPUs, by default memory) among jobs.
By default each queue bases fair sharing of resources based on memory (https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Introduction):
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types. By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness
schedulingPolicy: to set the scheduling policy of any queue. The allowed values are “fifo”/“fair”/“drf” or any class that extends
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy
. Defaults to “fair”. If “fifo”, apps with earlier submit times are given preference for containers, but apps submitted later may run concurrently if there is leftover space on the cluster after satisfying the earlier app’s requests.
On the other hand it should be possible to submit multiple jobs within a single application as long as each has its own thread. I assume one would use multiprocessing.
From Scheduling within an application
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
See also How to run multiple jobs in one Sparkcontext from separate threads in PySpark?
Upvotes: 1