PySpark Project Deployment on AWS Glue with Package dependency and External Library

Question

I working on building Data Pipelines in PySPARK. I am expecting to have 10-20 PySpark Jobs. These jobs can use common libraries and packages. The structure of Pyspark project could be like this.


data-pipeline
    |__data                 # Extracted main data source directory in .json format
    |__dist                 # PySpark Package files in .zip
    |__jobs                 # PySpark Jobs  
        |__job_1              
        |__job_2            
        |__job_2            
        .......             
        .......             
    |__lib                  # Shared lib folder
    |__resources            # JAR Required for Spark Extra Path
    |__test                 # PySPARK jobs Tests
    |__config.json          # Spark, Jobs and DWH configuration for local 
    |__main.py              # Main Spark Driver Program

I want to use AWS Glue to deploy my Spark Jobs. Is there any possibility via AWS Glue if i can upload a python project package just like we zip for AWS EMR Step function jobs?

My understanding is we can run a single PySpark Script .py file to AWS Glue, Is there any other way to deploy this whole package and use Application arguments to start the required job using AWS Step Function? Keep in mind there might be some shared Python package in the lib/ folder.

Am I missing something related to AWS Glue?

PySpark Project Deployment on AWS Glue with Package dependency and External Library

Answers (1)

Related Questions