Praful
Praful

Reputation: 167

How to add external library in a glue job using python shell

I tried to run a Glue job in python-shell by adding external dependencies (like pyathena, pytest,etc ..) as python egg file/ whl file in the job configurations as mentioned in the AWS documentation https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html.

The Glue job is configured under VPC having no internet and its execution resulted in the below error.

WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fd05d6a4f28>, 'Connection to pypi.org timed out. (connect timeout=15)')'

I even tried modifying my python script with the below code

import os
import site
import importlib
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']

libraries = ["pyathena"]

for lib in libraries:
    easy_install.main( ["--install-dir", install_path , lib] )

importlib.reload(site)

On executing the above code i faced below error

Download error on https://pypi.org/simple/pyathena/: [Errno 99] Cannot assign requested address -- Some packages may not be found! Couldn't find index page for 'pyathena' (maybe misspelled?)

Can i have sample code snippet to generate an egg/whl file for external python packages and to add those part of Glue python-shell job

Upvotes: 1

Views: 4549

Answers (1)

Prabhakar Reddy
Prabhakar Reddy

Reputation: 5144

Refer to this doc which has steps in detail for packaging a python library. Also make sure that your VPC has s3 endpoint enter link description here as traffic will not leave AWS network when you run a Glue job inside VPC.

Upvotes: 1

Related Questions