awenclaw
awenclaw

Reputation: 543

AWS Glue: ModuleNofFoundError

In my glue script (Spark 3.1, Python 3, Glue 3) I'm trying to use df.to_excel() function from pandas library. Apparently pandas library has dependencies on openpyxl. My code is:

import sys
import boto3
import openpyxl
import pandas as pd

client = boto3.client('s3')
obj = client.get_object(Bucket = 'myBucketName', Key = 'myFileName.csv')

df = pd.read_csv(obj['Body'])

df.to_excel("output.xlsx", sheet_name='my-sheet-name')

Issue I'm having is getting error: ModuleNotFoundError: No module named 'openpyxl'

I found below links that explains how to add external python libraries:
https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library
Using Pandas AWS Glue Python Shell Jobs
Apparently I did something wrong because it doesn't work for me. My steps are:

  1. Create setup.py file locally:

    from setuptools import setup

    setup( name="openpyxl", version="3.0.7", install_requires=['openpyxl'] )

  2. execute in my local directory py setup.py develop (I'm on Windows, my python version is 3.9.7)

  3. execute in my local directory py setup.py bdist_egg

  4. copy file ../dist/openpyxl-3.0.7-py3.9.egg into my s3 bucket

  5. in my glue job I put file location in Python library path


What am I doing wrong? What am I missing?
Thanks in advance!

Upvotes: 0

Views: 3865

Answers (1)

Coin Graham
Coin Graham

Reputation: 1584

In the newer versions of glue can you skip the egg/wheel approach and install at runtime. In the Job Parameters put a key "--additional-python-modules" and in the value put "openpyxl, pandas".

https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/

Upvotes: 2

Related Questions