Loheek
Loheek

Reputation: 1985

Join tables from different Glue catalogs with PySpark on EMR

To query a Glue Catalog from PySpark on EMR, I set the parameter hive.metastore.glue.catalogid in my cluster configuration.

Is it possible to join tables from different Glue catalogs (on different AWS accounts) ?

I tried to create a view with Athena from one AWS tenant to the other, but apparently PySpark is not able to query SQL views.

Upvotes: 4

Views: 2401

Answers (1)

Golammott
Golammott

Reputation: 536

This is possible in Pyspark by setting the catalog separator config.

pyspark --conf spark.hadoop.aws.glue.catalog.separator="/"

The desired catalogs can then be selected directly from your Pyspark sql query. Note the catalog id (account id) is delimited by the separator /:

spark.sql(select * from `111122223333/demodb.tab1` t1 inner join  `444455556666/demodb.tab2` t2 on t1.col1 = t2.col2).show()

Source AWS Doc

Upvotes: 4

Related Questions