Reputation: 871
I have the following folder structure. I zipped the the source
folder and run spark-submit
with the source.zip
as --py-files
. My problem is, how do I read the config.hcl
file from the PySpark application? I tried SparkFiles.getRootDirectory()+'/source/config.hcl'
but that didn't work. The error says no such file or directoy. Many thanks for your help. I'm trying to read config.hcl
from app.py
. How do I refer it in a zip? Or how do I unzip it first?
source
| config.hcl
app.py.
Upvotes: 2
Views: 1486
Reputation: 6082
There are two main reasons you weren't able to read the config.hcl
file:
--py-files
, the package remains inside zip file without extraction (for example /private/var/folders/81/c3fgx2qx6nq3lh2v983cdcd80000gn/T/spark-043999a0-c7fb-409c-a95d-4b8a902e55f0/userFiles-c3301b1a-b47e-4411-a2e9-ef0d8c2dc347/a.zip
)config.hcl
is a bit different, you'd need to read by ZipFile insteadI created a test file a.zip
with the following structure
├── a
│ ├── __init__.py
│ ├── a.py
│ └── a.txt
# __init__.py
from .a import *
# a.py
from os import path
from zipfile import ZipFile
def test():
zip = ZipFile(path.dirname(path.dirname(path.abspath(__file__))))
with zip.open('a/a.txt') as f:
print(f.readlines())
# a.txt
'Hello World'
# spark-submit --py-files a.zip ...
import a
a.test()
# 'Hello World'
Upvotes: 1