AWS Glue and Python Integration

Question

I have a data normalization process that exists in python but now needs to scale. This process currently runs via a job-specific configuration file containing a list of transforming functions that need to be applied to a table of data for that job. The transforming functions are mutually exclusive and can be applied in any order. All transforming functions live in a library and only get imported and applied to the data when they are listed in the job-specific configuration file. Different jobs will have different required functions listed in the configuration for that job, but all functions will exist in the library.

In the most general sense, how might a process like this be handled by AWS Glue? I don't need a technical example as much as a high level overview. Simply looking to be aware of some options. Thanks!

Javier Ramirez · Accepted Answer

The single most important thing you need to consider when using AWS glue is that is a serverless spark-based environment with extensions. That means you will need to adapt your script to be pySpark-like. If you are OK with that, then you can use external python libraries by following the instructions at AWS Glue Documentation

If you already have your scripts running and you don't feel like using Spark, you can always consider the AWS Data Pipeline. It's a service to run data transforms in more ways than just Spark. On the downside, AWS Data Pipeline is Task-driven, not Data-driven, which means no catalog or schema management.

If you want to use AWS Data Pipeline with Python is not obvious when you read the documentation, but the process would be basically staging a shell file into S3 with the instructions to set up your python environment and to invoke the script. Then you configure scheduling for the pipeline and AWS will take care of starting the virtual machines whenever needed and stopping afterwards. You have a good post at stackoverflow about this

AWS Glue and Python Integration

Answers (1)

Related Questions