PySpark spark-submit command with --files argument Error

Question

I am running a PySpark job in Spark 2.3 cluster with the following command.

spark-submit 
--deploy-mode cluster 
--master yarn 
--files ETLConfig.json 
PySpark_ETL_Job_v0.2.py

ETLConfig.json has a parameter passed to the PySpark script. I am referring this config json file in the main block as below:

configFilePath = os.path.join(SparkFiles.getRootDirectory(), 'ETLConfig.json')
with open(configFilePath, 'r') as configFile:
    configDict = json.load(configFile)

But, the command throws the following error.

No such file or directory: u'/tmp/spark-7dbe9acd-8b02-403a-987d-3accfc881a98/userFiles-4df4-5460-bd9c-4946-b289-6433-drgs/ETLConfig.json'

May I know what's wrong with my script? I also tried with SparkFiles.get() command but it also didn't work.

Ryan Widmaier · Accepted Answer

You should be able to just load it from your PWD in the running driver. Yarn will start the master container process in the same folder as where --files will dump the file. For client mode that would be different, but for cluster mode it should work fine. For example, this works for me:

driver.py

from pyspark import SparkContext, SparkFiles
import os

with SparkContext() as sc:
    print "PWD: " + os.getcwd()
    print "SparkFiles: " + SparkFiles.getRootDirectory()
    data = open('data.json')
    print "Success!"

spark submit

spark-submit --deploy-mode cluster --master yarn --files data.json driver.py

Updated (comparing paths):

I updated my code to print both the PWD (which worked) and SparkFiles.getRootDirectory (which didn't work). For some reason the paths differ. I'm not sure why that is.. but loading files directly from the PWD is what I have always done for accessing files from the driver.

This is what paths printed:

PWD: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/rwidmaier/appcache/application_1539970334177_0004/container_1539970334177_0004_01_000001
SparkFiles: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/rwidmaier/appcache/application_1539970334177_0004/spark-e869ac40-66b4-427e-a928-deef73b34e40/userFiles-a1d8e17f-b8a5-4999-8

Update #2

Apparently, the way it works is --files and it's brethren only guarantee to provide the files in the SparkFiles.get(..) folder on the Executors, not on the Driver. HOWEVER, in order to ship them to the executors, Spark downloads them first to the PWD on the driver, which allows you to access it from there.

It actually only mentions the executors in the help text, not the driver.

  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

For reference, here is where the files are downloaded to the driver.

PySpark spark-submit command with --files argument Error

Answers (2)

Related Questions