Reputation: 1426
I understand that I can load an entire table from a JDBC Cataloged connection via the Glue context like so:
glueContext.create_dynamic_frame.from_catalog(
database="jdbc_rds_postgresql",
table_name="public_foo_table",
transformation_ctx="datasource0"
)
However, what I'd like to do is partially load a table using the cataloged connection as if I were using an uncataloged JDBC connection via the spark session like this:
query = "(select * from public.foo_table where date='%s') as data" % date_fm)
spark.read.jdbc(url=DB_URL, table=query)
Is there a way I could use the Cataloged connection instead?
Alternatively, when using an uncataloged connection I'm having a hard time understanding how to lock down access to the uncataloged connection such that only a Glue job can access it. What are my options?
Upvotes: 1
Views: 2556
Reputation: 49
In order to perform predicate pushdown to databases. you need to use sampleQuery parameter as described in this AWS Doc
"sampleQuery": (Optional) The custom SQL query statement for sampling. By default the sample query is executed by single executor. If you're reading a large dataset, you may need to enable JDBC partitioning to query a table in parallel. For more information, see Reading from JDBC Tables in Parallel. To use sampleQuery with JDBC partitioning, also set enablePartitioningForSampleQuery to true.
ProTip - You need to enable enablePartitioningForSampleQuery to True. Also, end your SQL statement with "where" or "and" as descried in doc.
"enablePartitioningForSampleQuery": (Optional) By default this option is false. Required if you want to use sampleQuery with a partitioned JDBC table. If set to true, sampleQuery must end with "where" or "and" for AWS Glue to append partitioning conditions. See the example below.
Here is my code snippet – where I am brining data for sporting_event_id = 958 from MSSQL Database.
Also, I have added parallelism in my jdbc connection by passing parameter hashpartitions
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
database="mssql_dms_sample",
table_name="mssql_dms_sample_dbo_sporting_event_ticket",
transformation_ctx="S3bucket_node1",
additional_options = { "hashpartitions": "20" , "hashfiled":"id", "hashexpression":"sporting_event_id","enablePartitioningForSampleQuery":True, "sampleQuery":"select * from sporting_event_ticket where sporting_event_id = 958 and"}
)
Upvotes: 3
Reputation: 1525
In theory, you can achieve this with a catalogued connection in Glue using Pushdown predicates, where you add a push_down_predicate
parameter to filter your data when reading it. So, using your example:
glueContext.create_dynamic_frame.from_catalog(
database="jdbc_rds_postgresql",
table_name="public_foo_table",
transformation_ctx="datasource0",
push_down_predicate="date=%s" % date_fm
)
However, it seems that at the time of writing this is only supported for S3 sources.
Upvotes: 0