AWS Glue Job Flow

Question

I have an ETL job in Glue that processes a very large (300M row) JDBC database table, but I really only need a subset (certain ids) from this table. When I do glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons") Does this load the entire table at this command? Is there a way to write a custom query to load only the data I need? Or if I follow this with another command say Filter or a spark SQL command on the DataFrame will that filter as the data is pulled?

Kafels · Accepted Answer

Well, when you run:

glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons")

It only creates a Spark DF reference.

Spark works with transformations (i.e. filter, map, select) and actions (i.e. collect, count, show). You can read more about it here How Apache Spark’s Transformations And Action works, but basically, your database table only will load to memory when a action is called. This is one of many reasons Spark is so powerful and recommended to work with any size dataset.

This PDF show all transformations and actions available and some samples using them.

So yes, you need do some steps before like:

df = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons")
df = df.filter(YOUR_FILTER).select(SPECIFIC_COLS)

# Calling an action to show the filtered DF
df.show()

This will guarantee that you only load specific columns and rows to memory

AWS Glue Job Flow

Answers (1)

Related Questions