Reputation: 997
I am running a PySpark job in Spark 2.3 cluster with the following command.
spark-submit
--deploy-mode cluster
--master yarn
--files ETLConfig.json
PySpark_ETL_Job_v0.2.py
ETLConfig.json has a parameter passed to the PySpark script. I am referring this config json file in the main block as below:
configFilePath = os.path.join(SparkFiles.getRootDirectory(), 'ETLConfig.json')
with open(configFilePath, 'r') as configFile:
configDict = json.load(configFile)
But, the command throws the following error.
No such file or directory: u'/tmp/spark-7dbe9acd-8b02-403a-987d-3accfc881a98/userFiles-4df4-5460-bd9c-4946-b289-6433-drgs/ETLConfig.json'
May I know what's wrong with my script? I also tried with SparkFiles.get()
command but it also didn't work.
Upvotes: 2
Views: 10593
Reputation: 8523
You should be able to just load it from your PWD in the running driver. Yarn will start the master container process in the same folder as where --files
will dump the file. For client mode that would be different, but for cluster mode it should work fine. For example, this works for me:
driver.py
from pyspark import SparkContext, SparkFiles
import os
with SparkContext() as sc:
print "PWD: " + os.getcwd()
print "SparkFiles: " + SparkFiles.getRootDirectory()
data = open('data.json')
print "Success!"
spark submit
spark-submit --deploy-mode cluster --master yarn --files data.json driver.py
Updated (comparing paths):
I updated my code to print both the PWD (which worked) and SparkFiles.getRootDirectory
(which didn't work). For some reason the paths differ. I'm not sure why that is.. but loading files directly from the PWD is what I have always done for accessing files from the driver.
This is what paths printed:
PWD: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/rwidmaier/appcache/application_1539970334177_0004/container_1539970334177_0004_01_000001
SparkFiles: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/rwidmaier/appcache/application_1539970334177_0004/spark-e869ac40-66b4-427e-a928-deef73b34e40/userFiles-a1d8e17f-b8a5-4999-8
Update #2
Apparently, the way it works is --files
and it's brethren only guarantee to provide the files in the SparkFiles.get(..)
folder on the Executors, not on the Driver. HOWEVER, in order to ship them to the executors, Spark downloads them first to the PWD on the driver, which allows you to access it from there.
It actually only mentions the executors in the help text, not the driver.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
For reference, here is where the files are downloaded to the driver.
Upvotes: 3
Reputation: 92
You use cluster
deploy mode. In this case --files
path refers not to local path on the machine you use to submit, but to the local path on the worker that is used to spawn the driver, which is an arbitrary node in your cluster.
If you want to distribute files with cluster mode you should store these in a storage that can accessed by each node. You can for example use:
Upvotes: 5