Reputation: 806
I've been working with AWS Glue for the past 3-4 months to create PySpark scripts for ETL of large datasets. I'll typically create a notebook to do some exploratory work, then create a full-fledged version of the script, which I trigger manually via the console. I'm at the point where I have working bits and pieces which I now need to string together into a more robust and managed production pipeline.
I will be managing two unrelated datasets, each of which entails successive cleansing and transformation operations performed by different Glue jobs, with intermediate and final data stored in S3.
When I look at my Jobs page in AWS Glue Studio (AWS Glue -> ETL Jobs) everything is all jumbled together: notebooks as well as the multiple jobs for each of my data pipelines
There's plenty of great content available on how to create, run and optimize individual AWS Glue jobs, but I haven't been able to find anything comprehensive that describes best-practice of how to organize and manage everything. I anticipate at some point I will add some sort of orchestration layer on top of the jobs, but it feels like that still leaves the question of how to organize and manage the underlying jobs themselves.
Questions
Either direct answers or pointers to recommended reading (that does more than just scratch the surface) would be much appreciated.
Upvotes: 3
Views: 551
Reputation: 806
I've described below at a high level how I've set things up but didn't go into the nitty gritty detail. Happy to answer separate specific, more focused questions on SO about any of this.
Separation of products and environments
I've created separate Organizational Units (see this post on AWS Organizations and OUs) for my two products.
Within each, I've created separate AWS accounts for Dev and Prod.
I also created a third Shared Services
Organizational Unit (more about this later). In addition to the technical benefits of separation of production code and data, this also allows me to easily split out my costs (I get a single invoice for the root organization, but it's broken out by Organizational Unit, which is useful).
Local development
Doing exploratory work directly within AWS was a pain due to the startup time for Notebooks or Glue jobs (and it gets expensive to just leave a Notebook running). And the AWS Glue Notebook web interface leaves a lot to be desired. I solved for this by using the official AWS Glue Docker container for local exploration (here's a post from AWS about it). I use it in two ways:
Filesystem agnosticism
I use the excellent universal-pathlib package which allows my code to not have to care whether data is on the local filesystem or in S3. I use configuration parameters to set a base path; for local development my base path is on my laptop, when running in AWS the config has the base path set to an S3 location.
AppConfig for configuration
I define my config parameters as YAML within AWS AppConfig. The structure and names are identical for dev and prod, but one instance lives in the development AWS account and the other within the production AWS account. This way I can have different values for dev and prod that point to the correct S3 locations, database connection strings, etc, and the code is ignorant about it.
Promotion of code
I don't have any logic in my Glue scripts. Instead, my core logic is within Python packages, and the scripts themselves are just wrappers to call entry point functions within my packages. This allows me to more easily modularize my logic and share common functionality between different jobs. I publish my packages to AWS CodeArtifact repositories that live within the Shared Services
OU/account that I mentioned previously (I used guidance from here and here to get that set up).
Workflow
So now my workflow is:
Further improvements I need to make
Code promotion through to production is currently a manual process, there's scope for automation. I know nothing about Infrastructure as Code (other than that it's a thing, and it sounds great!) and my account setup is a manual process. It would be nice to have everything defined using IaC which would allow me to easily spin up/down a separate AWS account as a sandbox for anything experimental that I want to do.
Upvotes: 3