mransley
mransley

Reputation: 661

Creating a Glue Data Catalog Table within a Glue Job

I have what I consider to be a pretty simple requirement.

I want to create a job that takes one file and transforms it into another file and then updates the data catalog meta data within Glue. This would allow another job to then pick up the new data source and consume it using Glue/EMR/Athena.

Now, I can do the transform without any issues but for the life of me I cannot work out how to create the table within Glue other than using a crawler or the console or the glue API - I would prefer to do this inside the job so that I can just call the next job rather than execute a crawler and wait for it to complete.

The issue with the glue API is that I also have to convert the Spark schema to understand the API layout.

In Spark on EMR I can create the glue data catalog table pretty easily (although not well documented!):

dataframe.write.mode(mode).format("parquet").option("path", parquet_path).saveAsTable(glue_table)
dataframe.write.format("parquet").mode(mode).save(parquet_path)

This doesn't work in Glue. While I can setup the Glue data catalog hive metadata store on the Spark session within the Glue job:

spark = SparkSession.builder \
    .appName(args['JOB_NAME']) \
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .enableHiveSupport() \
    .getOrCreate()

but when I try and set the database it says it doesn't exist and when I list the databases I get the following:

Databases=[Database(name=u'default', description=u'Default Hive database', locationUri=u'hdfs://ip-172-31-29-88.ap-southeast-2.compute.internal:8020/user/spark/warehouse')]

Which makes me think that glue doesn't work with the Glue data catalog - it seems to be using a default hive catalog, am I missing something?

The reason this is an issue is that in EMR I can do stuff like:

spark.sql("select * from my_glue_table")

Which will work, but I suspect that this will not work in a Glue job unless I run a crawler and I really don't see the need to run the crawler when in EMR I can pretty much do it with one line of code.

Am I missing something here?

Thanks in advance.

Upvotes: 5

Views: 14401

Answers (3)

Thom Lane
Thom Lane

Reputation: 1063

You can use the CREATE TABLE statement in Spark SQL to add the table to the AWS Glue Catalog.

spark.sql("USE database_name")

df.registerTempTable("df")

spark.sql("""
    CREATE TABLE table_name
    USING CSV
    AS SELECT * FROM df
""")

When writing to CSV, I had to make sure the URI location for the Glue database was set, otherwise I'd end up with 'Can not create a Path from an empty string' errors, even when setting LOCATION in the query.

When writing to Parquet, it worked by setting LOCATION to an Amazon S3 path.

Upvotes: 0

Yuriy Bondaruk
Yuriy Bondaruk

Reputation: 4750

You can create a temp table from DataFrame and run sql queries:

var dataDf = glueContext.sparkSession.read.format(format).load(path)
// or var dataDf = dynamicFrame.toDF()
dataDf.createOrReplaceTempView("my_glue_table")
val allDataDf = glueContext.sparkSession.sql("select * from my_glue_table")

To create a table in Data Catalog following code can help:

val table = new com.amazonaws.services.glue.catalog.Table(namespace, tblName, schema,
  partitions, parameters, location, serdeInfo, hiveCompatible)
glueContext.getCatalogClient.createTable(table)

Upvotes: 2

Related Questions