Reputation: 543
In my glue script (Spark 3.1, Python 3, Glue 3) I'm trying to use df.to_excel()
function from pandas library. Apparently pandas library has dependencies on openpyxl
. My code is:
import sys
import boto3
import openpyxl
import pandas as pd
client = boto3.client('s3')
obj = client.get_object(Bucket = 'myBucketName', Key = 'myFileName.csv')
df = pd.read_csv(obj['Body'])
df.to_excel("output.xlsx", sheet_name='my-sheet-name')
Issue I'm having is getting error: ModuleNotFoundError: No module named 'openpyxl'
I found below links that explains how to add external python libraries:
https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library
Using Pandas AWS Glue Python Shell Jobs
Apparently I did something wrong because it doesn't work for me. My steps are:
Create setup.py file locally:
from setuptools import setup
setup( name="openpyxl", version="3.0.7", install_requires=['openpyxl'] )
execute in my local directory py setup.py develop
(I'm on Windows, my python version is 3.9.7)
execute in my local directory py setup.py bdist_egg
copy file ../dist/openpyxl-3.0.7-py3.9.egg
into my s3 bucket
in my glue job I put file location in Python library path
What am I doing wrong? What am I missing?
Thanks in advance!
Upvotes: 0
Views: 3865
Reputation: 1584
In the newer versions of glue can you skip the egg/wheel approach and install at runtime. In the Job Parameters put a key "--additional-python-modules" and in the value put "openpyxl, pandas".
Upvotes: 2