Madiha Khalid
Madiha Khalid

Reputation: 380

PySpark Project Deployment on AWS Glue with Package dependency and External Library

I working on building Data Pipelines in PySPARK. I am expecting to have 10-20 PySpark Jobs. These jobs can use common libraries and packages. The structure of Pyspark project could be like this.


data-pipeline
    |__data                 # Extracted main data source directory in .json format
    |__dist                 # PySpark Package files in .zip
    |__jobs                 # PySpark Jobs  
        |__job_1              
        |__job_2            
        |__job_2            
        .......             
        .......             
    |__lib                  # Shared lib folder
    |__resources            # JAR Required for Spark Extra Path
    |__test                 # PySPARK jobs Tests
    |__config.json          # Spark, Jobs and DWH configuration for local 
    |__main.py              # Main Spark Driver Program
   

I want to use AWS Glue to deploy my Spark Jobs. Is there any possibility via AWS Glue if i can upload a python project package just like we zip for AWS EMR Step function jobs?

enter image description here

My understanding is we can run a single PySpark Script .py file to AWS Glue, Is there any other way to deploy this whole package and use Application arguments to start the required job using AWS Step Function? Keep in mind there might be some shared Python package in the lib/ folder.

Am I missing something related to AWS Glue?

Upvotes: 1

Views: 675

Answers (1)

joseromerobarc
joseromerobarc

Reputation: 56

There are a couple of ways to do it. The easiest way to do it (since you mentioned the libraries needed were pretty common) would be to add the library as a custom parameter for the glue job. For example, if you wanted to use the openai and langchain libraries, you could add them using glue's built-in --additional-python-modules parameter:

enter image description here

You can also package up a custom library, save in S3 and reference that path as either a library or JAR (see above screenshot as well). Here's some documentation that I think might give more context:

Adding additional python libraries to glue jobs

Hope this helps!

Upvotes: 1

Related Questions