Reputation: 15330

How Do I Enable Fair Scheduler in PySpark?

According to the docs

Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them

And

The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf

So I can do the first part easily enough:

__sp_conf = SparkConf()
__sp_conf.set("spark.scheduler.mode", "FAIR")
sc = SparkContext(conf=__sp_conf)
sc.setLocalProperty("spark.scheduler.pool", "default")

But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?

I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.

My other thought was modifying the PYSPARK_SUBMIT_ARGS environment variable to try messing around with the command sent to spark-submit but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.

To be clear, if I don't provide the fairscheduler.xml file Spark complains

WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.

Upvotes: 4

Answers (2)

Ram Ghadiyaram

Reputation: 29237

Question : But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?

Below points especially #4 can help in this case based on the mode you are submitting job.

Here I am trying to list out all...

To use the Fair Scheduler first assign the appropriate scheduler class in yarn-site.xml:
```
<property>
  <name>yarn.resourcemanager.scheduler.class</name> 
```
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

your way of __sp_conf.set or simply below way can work

sudo vim /etc/spark/conf/spark-defaults.conf

spark.master yarn

...
spark.yarn.dist.files
/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

spark.scheduler.mode FAIR
spark.scheduler.allocation.file fairscheduler.xml

Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml

<?xml version="1.0"?>
<!--Licensed to the Apache Software Foundation
(ASF) under one or morecontributor license agreements. See the NOTICE
file distributed withthis work for additional information regarding
copyright ownership.The ASF licenses this file to You under the Apache
License, Version 2.0(the "License"); you may not use this file except
in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions and limitations under the License.-->

<allocations>
    <pool name="sparkmodule1">
        <schedulingMode>FAIR</schedulingMode>
        <weight>1</weight>
        <minShare>2</minShare>
    </pool>
    <pool name="sparkmodule2">
        <schedulingMode>FAIR</schedulingMode>
        <weight>1</weight>
        <minShare>2</minShare>
    </pool>

<pool name="test">
    <schedulingMode>FIFO</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
</pool>
<pool name="default">
    <schedulingMode>FAIR</schedulingMode>
    <weight>3</weight>
    <minShare>3</minShare>
</pool>
</allocations>

where sparkmodule1... are the modules to which you want to create dedicated pool of resources.

Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.

Sample Spark submit like below when you are in cluster mode

spark-submit --name "jobname" --class --master yarn --deploy-mode cluster --files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

Note : In client mode if we want to submit a spark job other than home directory with client mode create a symlink of fairscheduler.xml to point to the directory you want to point. for example scripts folder where you are executing spark-submit from ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml

Note : If you don't want to copy fairscheduler.xml to /home/hadoop folder you can create fairscheduler.xml under /etc/spark/conf/fairscheduler.xml and you can give sym link to the directory where you are executing spark submit like described above.

References : Spark Fair scheduler example

To cross verify :

The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.

like...

Upvotes: 2

Nagilla Venkatesh

Reputation: 36

The following steps we will take:

Run a simple Spark Application and review the Spark UI History Server.
Create a new Spark FAIR Scheduler pool in an external XML file.
Set the spark.scheduler.pool to the pool created in external XML file.
Update code to use threads to trigger use of FAIR pools and rebuild.
Re-deploy the Spark Application with:
- spark.scheduler.mode configuration variable to FAIR.
- spark.scheduler.allocation.file configuration
Run and review Spark UI History Server.

REFERENCE

Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE

Upvotes: 0

How Do I Enable Fair Scheduler in PySpark?

Answers (2)

Related Questions