Reputation: 2635
I have an ETL job written in python, which consist of multiple scripts with following directory structure;
my_etl_job
|
|--services
| |
| |-- __init__.py
| |-- dynamoDB_service.py
|
|-- __init__.py
|-- main.py
|-- logger.py
main.py
is the entrypoint script that imports other scripts from above directories. The above code runs perfectly fine on dev-endpoint, after uploading on the ETL cluster created by dev endpoint. Since now I want to run it in production, I want to create a proper glue job for it. But when I compress the whole directory my_etl_job
in .zip
format, upload it in artifacts s3 bucket, and specify the .zip file location into script location as follows
s3://<bucket_name>/etl_jobs/my_etl_job.zip
This is the code I see on glue job UI dashboard;
PK
���P__init__.pyUX�'�^"�^A��)PK#7�P logger.pyUX��^1��^A��)]�Mk�0����a�&v+���A�B���`x����q��} ...AND ALLOT MORE...
Seems like the glue job doesn't accepts .zip format ? if yes, then what compression format shall I use ?
UPDATE:
I checked out that glue job has option of taking in extra files Referenced files path
where I provided a comma separated list of all paths of the above files, and changed the script_location to refer to only main.py
file path. But that also didn't worked. Glue job throws error no module found logger (and I defined this module inside logger.py file)
Upvotes: 20
Views: 34637
Reputation: 911
I'm using Glue v2.0 using the Spark job type (rather than Python shell) and had a similar issue.
In addition to the previous answers regarding zip files, which discuss:
main.py
should not be zipped..zip
file archive corelib.zip
(or services.zip
) should contain corelib
(or services
) folder and its contents.I followed this and was still getting ImportError: No module named
errors when trying to import my module.
After adding the following snippet to my Glue Job script:
import sys
import os
print(f"os.getcwd()={os.getcwd()}")
print(f"os.listdir('.')={os.listdir('.')}")
print(f"sys.path={sys.path}")
I could see that the current working directory contained my zip file.
But sys.path
did not include the current working directory.
So Python was unable to import my zip file, resulting in a ImportError: No module named
error.
To resolve the import issue, I simply added the following code to my Glue Job script.
import sys
sys.path.insert(0, "utils.zip")
import utils
For reference:
The contents of my utils.zip
Archive: utils.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
0 Defl:N 5 0% 01-01-2049 00:00 00000000 __init__.py
6603 Defl:N 1676 75% 01-01-2049 00:00 f4551ccb utils.py
-------- ------- --- -------
6603 1681 75% 2 files
(Note that __init__.py
must be present for a module import to work)
My local project folder structure
my_job_stuff
|-- utils
| |-- __init__.py
| |-- utils.py
|-- main.py
Upvotes: 8
Reputation: 1166
my_etl_job
|
|--corelib
| |
| |--__init__.py
| |-- services
| |
| | -- dynamoDB_service.py
| | -- logger.py
|
|-- main.py
You can then import your dynamodbservices module in main.py as corelib.services.dynamoDB_service. When you prepare your library,just go to folder before corelib and zip up the folder like below
zip -r corelib.zip corelib/
You can then add the crelib.zip as your extra files in glue. (You can prepare a wheel file to.its your preference)
Upvotes: 2
Reputation: 176
You'll have to pass the zip file as extra python lib , or build a wheel package for the code package and upload the zip or wheel to s3, provide the same path as extra python lib option
Note: Have your main function written in the glue console it self , referencing the required function from the zipped/wheel dependency, you script location should never be a zip file
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
Upvotes: 13