asdfghjkl9999
asdfghjkl9999

Reputation: 53

Is there a way to use Spark SQL to query partition information in AWS Glue Data Catalog (similar to in Athena)?

I'm currently developing a Glue ETL script in PySpark that needs to query my Glue Data Catalog's partitions and join that information with other Glue tables programmatically.

At the moment, I'm able to do this with Athena using SELECT * FROM db_name.table_name$partitions JOIN table_name2 ON ..., but looks like this doesn't work with Spark SQL. The closest thing I've been able to find is SHOW PARTIIONS db_name.table_name, which doesn't seem to cut it.

Does anyone know an easy way I can leverage Glue ETL / Boto3 (Glue API) / PySpark to query my partition information in a SQL-like manner?

For the time being, the only possible workaround seems like the get_partitions() method in Boto3, but this looks like a lot more complex work to deal with from my end. I already have my Athena queries to get the information I need, so if there's ideally a way to replicate getting my tables' partitions in a similar way using SQL, that'd be amazing. Please let me know, thank you!

Upvotes: 0

Views: 1297

Answers (1)

asdfghjkl9999
asdfghjkl9999

Reputation: 53

For those interested, an alternative workaround I've been able to find but still need to test out is the Athena API with the Boto3 client. I may also possibly use the AWS Wrangler integrated with Athena to retrieve a dataframe.

Upvotes: 0

Related Questions