AWS Glue: ModuleNofFoundError

Question

In my glue script (Spark 3.1, Python 3, Glue 3) I'm trying to use df.to_excel() function from pandas library. Apparently pandas library has dependencies on openpyxl. My code is:

import sys
import boto3
import openpyxl
import pandas as pd

client = boto3.client('s3')
obj = client.get_object(Bucket = 'myBucketName', Key = 'myFileName.csv')

df = pd.read_csv(obj['Body'])

df.to_excel("output.xlsx", sheet_name='my-sheet-name')

Issue I'm having is getting error: ModuleNotFoundError: No module named 'openpyxl'

I found below links that explains how to add external python libraries:
https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library
Using Pandas AWS Glue Python Shell Jobs
Apparently I did something wrong because it doesn't work for me. My steps are:

Create setup.py file locally:

from setuptools import setup

setup( name="openpyxl", version="3.0.7", install_requires=['openpyxl'] )
execute in my local directory py setup.py develop (I'm on Windows, my python version is 3.9.7)
execute in my local directory py setup.py bdist_egg
copy file ../dist/openpyxl-3.0.7-py3.9.egg into my s3 bucket
in my glue job I put file location in Python library path

What am I doing wrong? What am I missing?
Thanks in advance!

Coin Graham · Accepted Answer

In the newer versions of glue can you skip the egg/wheel approach and install at runtime. In the Job Parameters put a key "--additional-python-modules" and in the value put "openpyxl, pandas".

https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/

AWS Glue: ModuleNofFoundError

Answers (1)

Related Questions