Thelight
Thelight

Reputation: 369

Delta Lake Table metadata

Where does Delta Lake store the table metadata info. I am using spark 2.6(Not Databricks) on my standalone machine. My assumption was that if I restart spark, the table created in delta lake spark will be dropped(trying from Jupyter notebook). But it is not the case.

Upvotes: 5

Views: 10490

Answers (2)

Smalltalkguy
Smalltalkguy

Reputation: 399

Delta stores the metadata in _delta_log folder in the same folder as the location of table. It can be stored in HIVE but it depends on the log store configuration.

For more information please read this paper https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf

Upvotes: 0

zsxwing
zsxwing

Reputation: 20826

There are two types of tables in Apache Spark: external tables and managed tables. When creating a table using LOCATION keyword in the CREATE TABLE statement, it's an external table. Otherwise, it's a managed table and its location is under the directory specified by the Spark SQL conf spark.sql.warehouse.dir. Its default value is the spark-warehouse directory in the current work directory

Besides the data, Spark also needs to store the table metadata into Hive Metastore, so that Spark can know where is the data when a user uses the table name to query. Hive Metastore is usually a database. If a user doesn't specify a database for Hive Metastore, Spark will use en embedded database called Derby to store the table metadata on the local file system.

DROP TABLE command has different behaviors depending on the table type. When a table is a managed table, DROP TABLE will remove the table from Hive Metastore and delete the data. If the table is an external table, DROP TABLE will remove the table from Hive Metastore but still keep the data on the file system. Hence, the data files of an external table needs to be deleted from the file system manually by the user.

Upvotes: 11

Related Questions