Reputation: 15330
According to the docs
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them
And
The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf
So I can do the first part easily enough:
__sp_conf = SparkConf()
__sp_conf.set("spark.scheduler.mode", "FAIR")
sc = SparkContext(conf=__sp_conf)
sc.setLocalProperty("spark.scheduler.pool", "default")
But how do I get an xml file called fairscheduler.xml
onto the classpath? Also, the classpath of what? Just the driver? Every executor?
I've tried using the addFile()
fuction on SparkContext
but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.
My other thought was modifying the PYSPARK_SUBMIT_ARGS
environment variable to try messing around with the command sent to spark-submit
but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.
To be clear, if I don't provide the fairscheduler.xml
file Spark complains
WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
Upvotes: 4
Views: 7043
Reputation: 29185
Question : But how do I get an xml file called
fairscheduler.xml
onto the classpath? Also, the classpath of what? Just the driver? Every executor?
Below points especially #4 can help in this case based on the mode you are submitting job.
Here I am trying to list out all...
To use the Fair Scheduler first assign the appropriate scheduler class
in yarn-site.xml
:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
your way of __sp_conf.set
or simply below way can work
sudo vim /etc/spark/conf/spark-defaults.conf
spark.master yarn
...
spark.yarn.dist.files
/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
spark.scheduler.mode FAIR
spark.scheduler.allocation.file fairscheduler.xml
Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml
<?xml version="1.0"?>
<!--Licensed to the Apache Software Foundation
(ASF) under one or morecontributor license agreements. See the NOTICE
file distributed withthis work for additional information regarding
copyright ownership.The ASF licenses this file to You under the Apache
License, Version 2.0(the "License"); you may not use this file except
in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions and limitations under the License.-->
<allocations>
<pool name="sparkmodule1">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="sparkmodule2">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
</allocations>
where sparkmodule1... are the modules to which you want to create dedicated pool of resources.
Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default")
if no matching pool from your fairscheduler.xml it will go in to default pool naturally.
Sample Spark submit like below when you are in cluster mode
spark-submit --name "jobname" --class --master yarn --deploy-mode cluster --files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
Note : In client mode if we want to submit a spark job other than home directory with client mode create a symlink of fairscheduler.xml to point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml
Note : If you don't want to copy fairscheduler.xml to /home/hadoop folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xml
and you can give sym link to the directory where you are executing spark submit like described above.
References : Spark Fair scheduler example
To cross verify :
The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.
Upvotes: 2
Reputation: 36
The following steps we will take:
REFERENCE
Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE
Upvotes: 0