Reputation: 1036
I have a script which I'd like to pass a configuration file into. On the Glue jobs page, I see that there is a "Referenced files path" which points to my configuration file. How do I then use that file within my ETL script?
I've tried from configuration import *
, where the referenced file name is configuration.py
, but no luck (ImportError: No module named configuration).
Upvotes: 5
Views: 8771
Reputation: 911
I had this issue with a Glue v2 Spark job, rather than a Python shell job which the other answer discusses in detail.
The AWS documentation says that it is not necessary to zip a single .py
file. However, I decided to use a .zip
file anyway.
My .zip
file contains the following:
Archive: utils.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
0 Defl:N 5 0% 01-01-2049 00:00 00000000 __init__.py
6603 Defl:N 1676 75% 01-01-2049 00:00 f4551ccb utils.py
-------- ------- --- -------
6603 1681 75% 2 files
Note that __init__.py
is present and the archive is compressed using Deflate (usual zip format).
In my Glue Job, I added the referenced files path job parameter pointing to my zip file on S3.
In the job script, I needed to explicitly add my zip file to the Python path before the import would work.
import sys
sys.path.insert(0, "utils.zip")
import utils
Failing to do the above resulted in a ImportError: No module named
error.
For others who are struggling with this, inspecting the following variables helped me to debug the issue and arrive at the solution. Paste into your Glue job and view the results in Cloudwatch.
import sys
import os
print(f"os.getcwd()={os.getcwd()}")
print(f"os.listdir('.')={os.listdir('.')}")
print(f"sys.path={sys.path}")
Upvotes: 2
Reputation: 71
I noticed the same issue. I believe there is already a ticket to address it, but here is what AWS support suggests in the meantime.
If you are using referenced files path variable in a Python shell job, referenced file is found in
/tmp
, where Python shell job has no access by default. However, the same operation works successfully in Spark job, because the file is found in the default file directory.
Code below helps find the absolute path of sample_config.json
that was referenced in Glue job configuration and prints its contents.
import json
import sys, os
def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
for dir_name in sys.path:
candidate = os.path.join(dir_name, file_name)
if matchFunc(candidate):
return candidate
raise Exception("Can't find file: ".format(file_name))
with open(get_referenced_filepath('sample_config.json'), "r") as f:
data = json.load(f)
print(data)
Boto3 API can be used to access the referenced file as well
import boto3
s3 = boto3.resource('s3')
obj = s3.Object('sample_bucket', 'sample_config.json')
for line in obj.get()['Body']._raw_stream:
print(line)
Upvotes: 4