Reputation: 1
Using Spark to create database with location parameter point to a s3 bucket path, create dataframe and write as delta with saveAsTable, both the database and table show up in glue points to the s3 location. Then when I read with spark.sql(‘select * from mydb.tbl’)
, I get Table does not support reads:
But when using spark.read.format(‘delta’).load(‘s3 path’)
Data returns. How can I query that delta table from glue catalog and not providing the s3 path? I thought the metadata db stores those information like where the data file is.
Set up. EMR 6.15.0 Hadoop 3.3.6, Hive 3.1.4, JupyterEnterpriseGateway 2.6.0, Spark 3.4.1, Livy 0.7.1
Nodes are configured via below classification: spark-hive-site: hive.metastore.client.factory.class: com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
delta-defaults: delta.enabled: true
Code Snippet:
spark = SparkSession.build.config(‘spark.jars’, ‘/usr/share/aws/delta/lib/delta-core_2.12-2.4.0.jar’).config(‘spark.sql.extension’, ‘io.delta.sql.DeltaSparkSessionExtension’).config(‘spark.sql.catalog.spark_catalog’, ‘org.apache.spark.sql.delta.catalog.DeltaCatalog’).enableHiveSupport().getOrCreate()
spark.sql(“create database mydb LOCATION ‘s3://bucket_name/mydb’”)
df.write.format(‘delta’).mode(‘overwrote’).saveAsTable(‘tbl’)
at this point table and all delta stuff folder were saved to the s3 specified in create database line bucket with a new path under mydb/
spark.sql(‘select * from mydb.tbl’)
returns Table does not support reads
Is spark.read.format(‘delta’).load(‘path’) or DeltaTable.forPath(spark, ‘path’) only way to query delta table and still have to provide the path even though it is registered in the glue catalog?
What am I missing here?
Upvotes: 0
Views: 652
Reputation: 698
There are a few things in here that you should look for:
"delta.enabled":"true"
, you don't have to mention the jar that you have mentioned in "spark.jars"
as it should be included already once you have enabled delta
. Even though if you want to mention jars
you should include the jars mentioned in AWS Documentation."/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar"
delta
tables are v2
table formats, you can use spark.table("<tablename>")
to read the data from any OFT Tables i.e. Delta
, Iceberg
and Hudi
.ddf = spark.table("db.delta_table")
ddf.show()
DeltaTable
pyspark API, you can also use DeltaTable.forName()
, this let's you read table via table name.from delta.tables import DeltaTable
delta_tbl = DeltaTable.forName(spark, 'db.delta_table')
To start with I would suggest removing the delta-core
jar from spark.jars
Upvotes: 0
Reputation: 5068
I believe you are missing the following configuration from your spark session:
"spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
I have initialize my spark session using the following config and am able to query using spark.sql
:
"conf": {
"spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension",
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
"spark.jars": "/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar",
"spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
and result:
spark.sql("select * from mydb.delta_table;").show()
+---+-------------+--------------------+
| id|creation_date| last_update_time|
+---+-------------+--------------------+
|102| 2015-01-01|2015-01-01T13:51:...|
|101| 2015-01-01|2015-01-01T12:14:...|
|103| 2015-01-01|2015-01-01T13:51:...|
|100| 2015-01-01|2015-01-01T13:51:...|
+---+-------------+--------------------+
Upvotes: 0