Dipanshu Chaubey
Dipanshu Chaubey

Reputation: 890

ModuleNotFoundError: No module named 'psycopg2' [AWS Glue]

I am using AWS Glue where I'm trying to use psycopg2 in a pyspark script. Since glue does not support psycopg2 in its execution environment, I am passing it in --additional-python-moodules. Which is a way of installing additional Python modules in aws glue.

After following the steps mentioned in the docs. I am getting an error while running the job which says ModuleNotFoundError: No module named 'psycopg2._psycopg'

enter image description here

Here are the job parameters are being passed during the execution.

enter image description here

Solutions I have already tried:

  1. Using psycopg2-binary
  2. Passing a zip/whl file in --additional-python-modules
  3. Changing glue version to 2.0 and 3.0

Upvotes: 3

Views: 1296

Answers (2)

RKelley
RKelley

Reputation: 558

If you're running Glue 2.0 or later and getting this error despite passing in the library as a parameter to the glue job via "--additional-python-modules" , your issue is potentially that your job doesn't have the necessary outbound internet access to reach Pypl. My job is running in a private subnet, so this was an issue for me.

If you have the same issue, you can either--

  1. Reconfigure your VPC to allow your Glue job to have outbound internet access.

  2. Pre-package it locally and upload it to S3. (I chose this one)

Here's what I did:

  1. Prepackage locally and upload to S3

    rm -f $(S3_ARTIFACTS)/psycopg2_binary.zip
    mkdir -p $(S3_ARTIFACTS)/psycopg2_binary
    pip install psycopg2-binary==2.9.9 -t $(S3_ARTIFACTS)/psycopg2_binary
    cd $(S3_ARTIFACTS)/psycopg2_binary && zip -r ../psycopg2_binary.zip .
    cd ..
    aws s3 cp $(S3_ARTIFACTS)/psycopg2_binary.zip s3://$(S3_ETL_BUCKET)/psycopg2_binary.zip
    rm -rf $(S3_ARTIFACTS)/psycopg2_binary
  1. In the console under "Python Library Path" -- put the S3 path to the uploaded binary (e.g. s3://$(S3_ETL_BUCKET)/psycopg2_binary.zip).

Note that if you get the ModuleNotFoundError: No module named 'psycopg2._psycopg' , then the compiled C extension (_psycopg) is missing because the wheel you packaged wasn’t built in an environment compatible with Glue. If that's the case, then

  1. Use an Amazon Linux 2 Docker container to install and package psycopg2-binary.

  2. Create your zip package from within that container to ensure the compiled extensions are compatible with Glue.

For example, you can run:

rm -f $(S3_ARTIFACTS)/psycopg2_binary.zip
mkdir -p $(S3_ARTIFACTS)/psycopg2_binary
# Use an Amazon Linux 2 container to build the package
docker run --rm -v $(PWD)/$(S3_ARTIFACTS)/psycopg2_binary:/output amazonlinux:2 bash -c "\
  yum update -y && \
  yum install -y python3 python3-pip python3-devel gcc postgresql-devel libffi-devel zip && \
  pip3 install --no-cache-dir psycopg2-binary==2.9.9 -t /output"
cd $(S3_ARTIFACTS)/psycopg2_binary && zip -r ../psycopg2_binary.zip .
cd $(S3_ARTIFACTS) && aws s3 cp psycopg2_binary.zip s3://$(S3_ETL_BUCKET)/psycopg2_binary.zip
rm -rf $(S3_ARTIFACTS)/psycopg2_binary

  1. Necessary if your Glue job is running in a VPC -- Add an S3 endpoint to your VPC to ensure your Glue job can access it. Create an S3 VPC endpoint in your VPC and ensure the route table for that subnet ID is associated with it (or switch your Glue job to run in a subnet that has a gateway).

This process worked for me.

Upvotes: 0

Frosty
Frosty

Reputation: 698

As per the documentation, the libraries not supported by glue needs to be present in S3 path first i.e. .whl or .zip file can be uploaded in S3 and then path needs to be passed in --additional-python-modules like s3://<bucket-name>/<path-to-whl-zip-file>

From documentation: Dependencies must be hosted in Amazon S3 and the argument value should be a comma delimited list of Amazon S3 paths with no spaces.

As per the job parameters passed by you it just mentions psycopg2-binary which isn't really installing any python package and hence you are getting the error.

Hope it helps.

Upvotes: 1

Related Questions