Reputation: 1036

How to import referenced files in ETL scripts?

I have a script which I'd like to pass a configuration file into. On the Glue jobs page, I see that there is a "Referenced files path" which points to my configuration file. How do I then use that file within my ETL script?

I've tried from configuration import *, where the referenced file name is configuration.py, but no luck (ImportError: No module named configuration).

Upvotes: 5

Answers (2)

BjornO

Reputation: 911

I had this issue with a Glue v2 Spark job, rather than a Python shell job which the other answer discusses in detail.

The AWS documentation says that it is not necessary to zip a single .py file. However, I decided to use a .zip file anyway.

My .zip file contains the following:

Archive:  utils.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
       0  Defl:N        5   0% 01-01-2049 00:00 00000000  __init__.py
    6603  Defl:N     1676  75% 01-01-2049 00:00 f4551ccb  utils.py
--------          -------  ---                            -------
    6603             1681  75%                            2 files

Note that __init__.py is present and the archive is compressed using Deflate (usual zip format).

In my Glue Job, I added the referenced files path job parameter pointing to my zip file on S3.

In the job script, I needed to explicitly add my zip file to the Python path before the import would work.

import sys
sys.path.insert(0, "utils.zip")

import utils

Failing to do the above resulted in a ImportError: No module named error.

For others who are struggling with this, inspecting the following variables helped me to debug the issue and arrive at the solution. Paste into your Glue job and view the results in Cloudwatch.

import sys
import os

print(f"os.getcwd()={os.getcwd()}")
print(f"os.listdir('.')={os.listdir('.')}")
print(f"sys.path={sys.path}")

Upvotes: 2

Ibrahim Iskin

Reputation: 71

I noticed the same issue. I believe there is already a ticket to address it, but here is what AWS support suggests in the meantime.

If you are using referenced files path variable in a Python shell job, referenced file is found in /tmp, where Python shell job has no access by default. However, the same operation works successfully in Spark job, because the file is found in the default file directory.

Code below helps find the absolute path of sample_config.json that was referenced in Glue job configuration and prints its contents.

import json
import sys, os

def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
    for dir_name in sys.path:
        candidate = os.path.join(dir_name, file_name)
        if matchFunc(candidate):
            return candidate
    raise Exception("Can't find file: ".format(file_name))

with open(get_referenced_filepath('sample_config.json'), "r") as f:
    data = json.load(f)
    print(data)

Boto3 API can be used to access the referenced file as well

import boto3

s3 = boto3.resource('s3')
obj = s3.Object('sample_bucket', 'sample_config.json')
for line in obj.get()['Body']._raw_stream:
    print(line)

Upvotes: 4

How to import referenced files in ETL scripts?

Answers (2)

Related Questions