Jeff
Jeff

Reputation: 1426

How to load partial data from a JDBC cataloged connection in AWS Glue?

I understand that I can load an entire table from a JDBC Cataloged connection via the Glue context like so:

glueContext.create_dynamic_frame.from_catalog(
    database="jdbc_rds_postgresql",
    table_name="public_foo_table",
    transformation_ctx="datasource0"
)

However, what I'd like to do is partially load a table using the cataloged connection as if I were using an uncataloged JDBC connection via the spark session like this:

query = "(select * from public.foo_table where date='%s') as data" % date_fm)
spark.read.jdbc(url=DB_URL, table=query)

Is there a way I could use the Cataloged connection instead?

Alternatively, when using an uncataloged connection I'm having a hard time understanding how to lock down access to the uncataloged connection such that only a Glue job can access it. What are my options?

Upvotes: 1

Views: 2556

Answers (2)

Navnit Shukla
Navnit Shukla

Reputation: 49

In order to perform predicate pushdown to databases. you need to use sampleQuery parameter as described in this AWS Doc

"sampleQuery": (Optional) The custom SQL query statement for sampling. By default the sample query is executed by single executor. If you're reading a large dataset, you may need to enable JDBC partitioning to query a table in parallel. For more information, see Reading from JDBC Tables in Parallel. To use sampleQuery with JDBC partitioning, also set enablePartitioningForSampleQuery to true.

ProTip - You need to enable enablePartitioningForSampleQuery to True. Also, end your SQL statement with "where" or "and" as descried in doc.

"enablePartitioningForSampleQuery": (Optional) By default this option is false. Required if you want to use sampleQuery with a partitioned JDBC table. If set to true, sampleQuery must end with "where" or "and" for AWS Glue to append partitioning conditions. See the example below.

Here is my code snippet – where I am brining data for sporting_event_id = 958 from MSSQL Database.

Also, I have added parallelism in my jdbc connection by passing parameter hashpartitions

S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="mssql_dms_sample",
    table_name="mssql_dms_sample_dbo_sporting_event_ticket",
    transformation_ctx="S3bucket_node1",
    additional_options = {  "hashpartitions": "20" , "hashfiled":"id", "hashexpression":"sporting_event_id","enablePartitioningForSampleQuery":True, "sampleQuery":"select * from sporting_event_ticket where sporting_event_id = 958 and"}
)

Upvotes: 3

Tanveer Uddin
Tanveer Uddin

Reputation: 1525

In theory, you can achieve this with a catalogued connection in Glue using Pushdown predicates, where you add a push_down_predicate parameter to filter your data when reading it. So, using your example:

glueContext.create_dynamic_frame.from_catalog(
    database="jdbc_rds_postgresql",
    table_name="public_foo_table",
    transformation_ctx="datasource0",
    push_down_predicate="date=%s" % date_fm
)

However, it seems that at the time of writing this is only supported for S3 sources.

Upvotes: 0

Related Questions