Reputation: 23
I am trying to archive my application in my test file to spark submit on an EMR cluster like this:
Folder structure of modules:
app
--- module1
------ test.py
------ test2.py
--- module2
------ file1.py
------ file2.py
Zip function I'm calling from my tests
import zipfile
import os
def zip_deps():
# make zip
module1_path = '../module1'
module2_path = '../module2'
try:
with zipfile.ZipFile('deps.zip', 'w', zipfile.ZIP_DEFLATED) as zipf:
info = zipfile.ZipInfo(module1_path +'/')
zipf.writestr(info, '')
for root, dirs, files in os.walk(module1_path):
for d in dirs:
info = zipfile.ZipInfo(os.path.join(root, d)+'/')
zipf.writestr(info, '')
for file in files:
zipf.write(os.path.join(root, file),os.path.relpath(os.path.join(root, file)))
info = zipfile.ZipInfo(module2_path +'/')
zipf.writestr(info, '')
for root, dirs, files in os.walk(module2_path):
for d in dirs:
info = zipfile.ZipInfo(os.path.join(root, d)+'/')
zipf.writestr(info, '')
for file in files:
zipf.write(os.path.join(root, file),os.path.relpath(os.path.join(root, file)))
except:
print('Unexpected error occurred while creating file deps.zip')
zipf.close()
The deps.zip is created correctly, as far as I can see it zips all the files I want, and each module folder is at the base level of the zip.
In fact the exact zip created using:
zip -r deps.zip module1 module2
is the same structure and THIS works when I spark submit it with
spark-submit --py-files deps.zip driver.py
Error from EMR:
Traceback (most recent call last):
File "driver.py", line 6, in <module>
from module1.test import test_function
ModuleNotFoundError: No module named 'module1'
FWIW I also tried zipping using a subprocess with the following commands and I got the same error on my EMR in spark
os.system("zip -r9 deps.zip ../module1")
os.system("zip -r9 deps.zip ../module2")
I don't know why a zip file created in python would be different than outside of python, but I've spent the last few days on this and hopefully someone can help!
Thanks!!
Upvotes: 1
Views: 1683
Reputation: 23
It turns out it was something fairly simple...
Zipfile was saving the full filename with relative directory such as:
../module1/test.py
spark is excepting the folders to be on the top level without that relative path like:
module1/test.py
I just had to change my write to be like this:
with zipfile.ZipFile('deps.zip','w') as zipf:
for file in file_paths:
zipf.write(file,os.path.relpath(file,'..'))
If you extract the original zip file you'd never see the names with the ../
in front. Shrug
Upvotes: 1