Reputation: 1

Delta Lake table PySpark/Glue Catalog EMR

Using Spark to create database with location parameter point to a s3 bucket path, create dataframe and write as delta with saveAsTable, both the database and table show up in glue points to the s3 location. Then when I read with spark.sql(‘select * from mydb.tbl’), I get Table does not support reads: But when using spark.read.format(‘delta’).load(‘s3 path’) Data returns. How can I query that delta table from glue catalog and not providing the s3 path? I thought the metadata db stores those information like where the data file is.

Set up. EMR 6.15.0 Hadoop 3.3.6, Hive 3.1.4, JupyterEnterpriseGateway 2.6.0, Spark 3.4.1, Livy 0.7.1

Nodes are configured via below classification: spark-hive-site: hive.metastore.client.factory.class: com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

delta-defaults: delta.enabled: true

Code Snippet:

spark = SparkSession.build.config(‘spark.jars’, ‘/usr/share/aws/delta/lib/delta-core_2.12-2.4.0.jar’).config(‘spark.sql.extension’, ‘io.delta.sql.DeltaSparkSessionExtension’).config(‘spark.sql.catalog.spark_catalog’, ‘org.apache.spark.sql.delta.catalog.DeltaCatalog’).enableHiveSupport().getOrCreate()

spark.sql(“create database mydb LOCATION ‘s3://bucket_name/mydb’”)

df.write.format(‘delta’).mode(‘overwrote’).saveAsTable(‘tbl’)

at this point table and all delta stuff folder were saved to the s3 specified in create database line bucket with a new path under mydb/

spark.sql(‘select * from mydb.tbl’)

returns Table does not support reads

Is spark.read.format(‘delta’).load(‘path’) or DeltaTable.forPath(spark, ‘path’) only way to query delta table and still have to provide the path even though it is registered in the glue catalog?

What am I missing here?

Upvotes: 0

Answers (2)

Frosty

Reputation: 698

There are a few things in here that you should look for:

Once you have delta table enabled in your EMR i.e. configuration with "delta.enabled":"true", you don't have to mention the jar that you have mentioned in "spark.jars" as it should be included already once you have enabled delta. Even though if you want to mention jars you should include the jars mentioned in AWS Documentation.

"/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar"

As delta tables are v2 table formats, you can use spark.table("<tablename>") to read the data from any OFT Tables i.e. Delta, Iceberg and Hudi.

ddf = spark.table("db.delta_table")
ddf.show()

If you want to use DeltaTable pyspark API, you can also use DeltaTable.forName(), this let's you read table via table name.

from delta.tables import DeltaTable
delta_tbl = DeltaTable.forName(spark, 'db.delta_table')

To start with I would suggest removing the delta-core jar from spark.jars

Upvotes: 0

Waqar Ahmed

Reputation: 5068

I believe you are missing the following configuration from your spark session:

"spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"

I have initialize my spark session using the following config and am able to query using spark.sql:

"conf": {
    "spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension",
    "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    "spark.jars": "/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar",
    "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}

and result:

spark.sql("select * from mydb.delta_table;").show()

+---+-------------+--------------------+
| id|creation_date|    last_update_time|
+---+-------------+--------------------+
|102|   2015-01-01|2015-01-01T13:51:...|
|101|   2015-01-01|2015-01-01T12:14:...|
|103|   2015-01-01|2015-01-01T13:51:...|
|100|   2015-01-01|2015-01-01T13:51:...|
+---+-------------+--------------------+

Upvotes: 0

Delta Lake table PySpark/Glue Catalog EMR

Answers (2)

Related Questions