Reputation: 661

Creating a Glue Data Catalog Table within a Glue Job

I have what I consider to be a pretty simple requirement.

I want to create a job that takes one file and transforms it into another file and then updates the data catalog meta data within Glue. This would allow another job to then pick up the new data source and consume it using Glue/EMR/Athena.

Now, I can do the transform without any issues but for the life of me I cannot work out how to create the table within Glue other than using a crawler or the console or the glue API - I would prefer to do this inside the job so that I can just call the next job rather than execute a crawler and wait for it to complete.

The issue with the glue API is that I also have to convert the Spark schema to understand the API layout.

In Spark on EMR I can create the glue data catalog table pretty easily (although not well documented!):

dataframe.write.mode(mode).format("parquet").option("path", parquet_path).saveAsTable(glue_table)
dataframe.write.format("parquet").mode(mode).save(parquet_path)

This doesn't work in Glue. While I can setup the Glue data catalog hive metadata store on the Spark session within the Glue job:

spark = SparkSession.builder \
    .appName(args['JOB_NAME']) \
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .enableHiveSupport() \
    .getOrCreate()

but when I try and set the database it says it doesn't exist and when I list the databases I get the following:

Databases=[Database(name=u'default', description=u'Default Hive database', locationUri=u'hdfs://ip-172-31-29-88.ap-southeast-2.compute.internal:8020/user/spark/warehouse')]

Which makes me think that glue doesn't work with the Glue data catalog - it seems to be using a default hive catalog, am I missing something?

The reason this is an issue is that in EMR I can do stuff like:

spark.sql("select * from my_glue_table")

Which will work, but I suspect that this will not work in a Glue job unless I run a crawler and I really don't see the need to run the crawler when in EMR I can pretty much do it with one line of code.

Am I missing something here?

Thanks in advance.

Upvotes: 5

Creating a Glue Data Catalog Table within a Glue Job

Answers (3)

Related Questions