teejay
teejay

Reputation: 806

What's an organizing structure and development workflow for AWS Glue jobs?

I've been working with AWS Glue for the past 3-4 months to create PySpark scripts for ETL of large datasets. I'll typically create a notebook to do some exploratory work, then create a full-fledged version of the script, which I trigger manually via the console. I'm at the point where I have working bits and pieces which I now need to string together into a more robust and managed production pipeline.

I will be managing two unrelated datasets, each of which entails successive cleansing and transformation operations performed by different Glue jobs, with intermediate and final data stored in S3.

When I look at my Jobs page in AWS Glue Studio (AWS Glue -> ETL Jobs) everything is all jumbled together: notebooks as well as the multiple jobs for each of my data pipelines

There's plenty of great content available on how to create, run and optimize individual AWS Glue jobs, but I haven't been able to find anything comprehensive that describes best-practice of how to organize and manage everything. I anticipate at some point I will add some sort of orchestration layer on top of the jobs, but it feels like that still leaves the question of how to organize and manage the underlying jobs themselves.

Questions

  1. Is there a way to organize jobs into three (and in future possibly more) "buckets": production-version jobs for dataset A pipeline; production-version jobs for dataset B pipeline; exploratory notebooks.
  2. What are recommendations for a development workflow for AWS Glue? I'm used to a more traditional application development workflow with distinct environments (Dev, Prod, etc), and a code repository and CI/CD pipeline to promote code through environments. I am wondering what's the Glue equivalent.

Either direct answers or pointers to recommended reading (that does more than just scratch the surface) would be much appreciated.

Upvotes: 3

Views: 551

Answers (1)

teejay
teejay

Reputation: 806

I've described below at a high level how I've set things up but didn't go into the nitty gritty detail. Happy to answer separate specific, more focused questions on SO about any of this.

Separation of products and environments

I've created separate Organizational Units (see this post on AWS Organizations and OUs) for my two products. Within each, I've created separate AWS accounts for Dev and Prod. I also created a third Shared Services Organizational Unit (more about this later). In addition to the technical benefits of separation of production code and data, this also allows me to easily split out my costs (I get a single invoice for the root organization, but it's broken out by Organizational Unit, which is useful).

Local development

Doing exploratory work directly within AWS was a pain due to the startup time for Notebooks or Glue jobs (and it gets expensive to just leave a Notebook running). And the AWS Glue Notebook web interface leaves a lot to be desired. I solved for this by using the official AWS Glue Docker container for local exploration (here's a post from AWS about it). I use it in two ways:

  1. Run a Jupyter notebook within VSCode against the Spark engine within the Docker container. This allows me to do early-stage exploratory Spark work, experiment with bits of code, etc. Small datasets as it's all on my laptop. Having the Jupyter notebook in VSCode gives me the benefits of IntelliSense and Github Copilot code suggestions.
  2. Run my unit tests against the Spark engine in the container.

Filesystem agnosticism

I use the excellent universal-pathlib package which allows my code to not have to care whether data is on the local filesystem or in S3. I use configuration parameters to set a base path; for local development my base path is on my laptop, when running in AWS the config has the base path set to an S3 location.

AppConfig for configuration

I define my config parameters as YAML within AWS AppConfig. The structure and names are identical for dev and prod, but one instance lives in the development AWS account and the other within the production AWS account. This way I can have different values for dev and prod that point to the correct S3 locations, database connection strings, etc, and the code is ignorant about it.

Promotion of code

I don't have any logic in my Glue scripts. Instead, my core logic is within Python packages, and the scripts themselves are just wrappers to call entry point functions within my packages. This allows me to more easily modularize my logic and share common functionality between different jobs. I publish my packages to AWS CodeArtifact repositories that live within the Shared Services OU/account that I mentioned previously (I used guidance from here and here to get that set up).

Workflow

So now my workflow is:

  1. Start up the Glue Docker container locally, connect to it with VSCode and do my experimental work on my laptop within a Jupyter notebook.
  2. Write my unit tests and code; run the tests within the Docker container.
  3. Run the code locally within the container against a small version of my data.
  4. When all looks good, publish the package to CodeArtifact with the version number incremented.
  5. Update the AWS Glue job within the development account to use the newer version and run it against a larger dataset, which allows me to make any tweaks for performance.
  6. Finally, update the AWS Glue job within the production account to use the newer version.

Further improvements I need to make

Code promotion through to production is currently a manual process, there's scope for automation. I know nothing about Infrastructure as Code (other than that it's a thing, and it sounds great!) and my account setup is a manual process. It would be nice to have everything defined using IaC which would allow me to easily spin up/down a separate AWS account as a sandbox for anything experimental that I want to do.

Upvotes: 3

Related Questions