user4046073
user4046073

Reputation: 871

How to read zip submitted by --pyfiles from PySpark application

I have the following folder structure. I zipped the the source folder and run spark-submit with the source.zip as --py-files. My problem is, how do I read the config.hcl file from the PySpark application? I tried SparkFiles.getRootDirectory()+'/source/config.hcl' but that didn't work. The error says no such file or directoy. Many thanks for your help. I'm trying to read config.hcl from app.py. How do I refer it in a zip? Or how do I unzip it first?

source
  | config.hcl
app.py. 

Upvotes: 2

Views: 1486

Answers (1)

pltc
pltc

Reputation: 6082

There are two main reasons you weren't able to read the config.hcl file:

  1. When uploading zip file and submit via --py-files, the package remains inside zip file without extraction (for example /private/var/folders/81/c3fgx2qx6nq3lh2v983cdcd80000gn/T/spark-043999a0-c7fb-409c-a95d-4b8a902e55f0/userFiles-c3301b1a-b47e-4411-a2e9-ef0d8c2dc347/a.zip)
  2. Because if (1), the way you read config.hcl is a bit different, you'd need to read by ZipFile instead

I created a test file a.zip with the following structure

├── a
│   ├── __init__.py
│   ├── a.py
│   └── a.txt
# __init__.py
from .a import *

# a.py
from os import path
from zipfile import ZipFile

def test():
    zip = ZipFile(path.dirname(path.dirname(path.abspath(__file__))))
    with zip.open('a/a.txt') as f:
        print(f.readlines())

# a.txt
'Hello World'
# spark-submit --py-files a.zip ...
import a
a.test()
# 'Hello World'

Upvotes: 1

Related Questions