Reputation: 380
I working on building Data Pipelines in PySPARK. I am expecting to have 10-20 PySpark Jobs. These jobs can use common libraries and packages. The structure of Pyspark project could be like this.
data-pipeline
|__data # Extracted main data source directory in .json format
|__dist # PySpark Package files in .zip
|__jobs # PySpark Jobs
|__job_1
|__job_2
|__job_2
.......
.......
|__lib # Shared lib folder
|__resources # JAR Required for Spark Extra Path
|__test # PySPARK jobs Tests
|__config.json # Spark, Jobs and DWH configuration for local
|__main.py # Main Spark Driver Program
I want to use AWS Glue to deploy my Spark Jobs. Is there any possibility via AWS Glue if i can upload a python project package just like we zip for AWS EMR Step function jobs?
My understanding is we can run a single PySpark Script .py
file to AWS Glue, Is there any other way to deploy this whole package and use Application arguments to start the required job using AWS Step Function? Keep in mind there might be some shared Python package in the lib/
folder.
Am I missing something related to AWS Glue?
Upvotes: 1
Views: 675
Reputation: 56
There are a couple of ways to do it. The easiest way to do it (since you mentioned the libraries needed were pretty common) would be to add the library as a custom parameter for the glue job. For example, if you wanted to use the openai and langchain libraries, you could add them using glue's built-in --additional-python-modules parameter:
You can also package up a custom library, save in S3 and reference that path as either a library or JAR (see above screenshot as well). Here's some documentation that I think might give more context:
Adding additional python libraries to glue jobs
Hope this helps!
Upvotes: 1