Reputation: 890
I am using AWS Glue where I'm trying to use psycopg2 in a pyspark script. Since glue does not support psycopg2 in its execution environment, I am passing it in --additional-python-moodules
. Which is a way of installing additional Python modules in aws glue.
After following the steps mentioned in the docs. I am getting an error while running the job which says ModuleNotFoundError: No module named 'psycopg2._psycopg'
Here are the job parameters are being passed during the execution.
Solutions I have already tried:
Upvotes: 3
Views: 1296
Reputation: 558
If you're running Glue 2.0 or later and getting this error despite passing in the library as a parameter to the glue job via "--additional-python-modules"
, your issue is potentially that your job doesn't have the necessary outbound internet access to reach Pypl. My job is running in a private subnet, so this was an issue for me.
If you have the same issue, you can either--
Reconfigure your VPC to allow your Glue job to have outbound internet access.
Pre-package it locally and upload it to S3. (I chose this one)
Here's what I did:
rm -f $(S3_ARTIFACTS)/psycopg2_binary.zip
mkdir -p $(S3_ARTIFACTS)/psycopg2_binary
pip install psycopg2-binary==2.9.9 -t $(S3_ARTIFACTS)/psycopg2_binary
cd $(S3_ARTIFACTS)/psycopg2_binary && zip -r ../psycopg2_binary.zip .
cd ..
aws s3 cp $(S3_ARTIFACTS)/psycopg2_binary.zip s3://$(S3_ETL_BUCKET)/psycopg2_binary.zip
rm -rf $(S3_ARTIFACTS)/psycopg2_binary
s3://$(S3_ETL_BUCKET)/psycopg2_binary.zip
).Note that if you get the ModuleNotFoundError: No module named 'psycopg2._psycopg'
, then the compiled C extension (_psycopg) is missing because the wheel you packaged wasn’t built in an environment compatible with Glue. If that's the case, then
Use an Amazon Linux 2
Docker container to install and package psycopg2-binary
.
Create your zip package from within that container to ensure the compiled extensions are compatible with Glue.
For example, you can run:
rm -f $(S3_ARTIFACTS)/psycopg2_binary.zip
mkdir -p $(S3_ARTIFACTS)/psycopg2_binary
# Use an Amazon Linux 2 container to build the package
docker run --rm -v $(PWD)/$(S3_ARTIFACTS)/psycopg2_binary:/output amazonlinux:2 bash -c "\
yum update -y && \
yum install -y python3 python3-pip python3-devel gcc postgresql-devel libffi-devel zip && \
pip3 install --no-cache-dir psycopg2-binary==2.9.9 -t /output"
cd $(S3_ARTIFACTS)/psycopg2_binary && zip -r ../psycopg2_binary.zip .
cd $(S3_ARTIFACTS) && aws s3 cp psycopg2_binary.zip s3://$(S3_ETL_BUCKET)/psycopg2_binary.zip
rm -rf $(S3_ARTIFACTS)/psycopg2_binary
This process worked for me.
Upvotes: 0
Reputation: 698
As per the documentation, the libraries not supported by glue needs to be present in S3 path first i.e. .whl or .zip file can be uploaded in S3 and then path needs to be passed in --additional-python-modules
like s3://<bucket-name>/<path-to-whl-zip-file>
From documentation: Dependencies must be hosted in Amazon S3 and the argument value should be a comma delimited list of Amazon S3 paths with no spaces.
As per the job parameters passed by you it just mentions psycopg2-binary
which isn't really installing any python package and hence you are getting the error.
Hope it helps.
Upvotes: 1