Joshua Stafford
Joshua Stafford

Reputation: 635

How can you find the size of a delta table quickly and accurately?

The microsoft documentation here: https://learn.microsoft.com/en-us/azure/databricks/kb/sql/find-size-of-table#size-of-a-delta-table suggests two methods:

Method 1:

import com.databricks.sql.transaction.tahoe._
val deltaLog = DeltaLog.forTable(spark, "dbfs:/<path-to-delta-table>")
val snapshot = deltaLog.snapshot               // the current delta table snapshot
println(s"Total file size (bytes): ${deltaLog.snapshot.sizeInBytes}")`

Method 2:

spark.read.table("<non-delta-table-name>").queryExecution.analyzed.stats

For my table, they both return ~300 MB.

But then in storage explorer Folder statistics or in a recursive dbutils.fs.ls walk, I get ~900MB.

So those two methods that are much quicker than literally looking at every file underreport by 67%. This would be fine to use the slower methods except when I try to scale up to the entire container, it takes 55 hours to scan all 1 billion files and 2.6 PB.

So what is the best way to get the size of a table in ADLS Gen 2? Bonus points if it works for folders that are not tables as that's really the number I need. dbutils.fs.ls is single threaded and only works on the driver, so it's not even very parallelizable. It can be threaded but only within the driver.

Upvotes: 7

Views: 13756

Answers (3)

Anupam Chand
Anupam Chand

Reputation: 2687

To get the size of the current version of the table, you can use

byte_size = spark.sql("describe detail customers").select("sizeInBytes").collect()
byte_size = (byte_size[0]["sizeInBytes"])
kb_size = byte_size/1024
mb_size = kb_size/1024
tb_size = mb_size/1024

print(f"Current table snapshot size is {byte_size}bytes or {kb_size}KB or {mb_size}MB or {tb_size}TB")

To get the size of the table including all the historical files/versions which have not been vacuumed yet, you can try executing the following command and refer to the statistics row

DESCRIBE EXTENDED customers

So, to extract the table size you can use :

spark.sql("DESCRIBE EXTENDED customers").createOrReplaceTempView("stats_table")
stats = spark.sql("select data_type from stats_table where col_name = 'Statistics'").collect()
print(stats[0]["data_type"])

This will give the same value if you manually iterate through all the files. The above code snippet should be much faster than iterating through all the files and sub directories in ADLS. All you need to do is feed in a list of tables. You can even extract the list of tables using the below code snippet.

#get list of all tables
catalog_name = "spark_catalog"
schema_list = spark.sql(f"show schemas in {catalog_name}").select('databaseName').rdd.map(lambda x : x[0]).collect()
for database_name in schema_list:
    table_list = spark.sql(f"show tables from {database_name}").select('tableName').rdd.map(lambda x : x[0]).collect()
    for table in table_list:
        print(f"{database_name}.{table}")

Can you have a go at this with your large tables and share the performance?

Upvotes: 0

Mahesh Malpani
Mahesh Malpani

Reputation: 1999

Got answer from https://community.databricks.com/s/question/0D58Y00009MIRSDSA5/how-to-find-the-size-of-a-table-in-python-or-sql

%python
spark.sql("describe detail schema.delta_table_name").select("sizeInBytes").collect()

Upvotes: -1

Kombajn zbożowy
Kombajn zbożowy

Reputation: 10703

deltaLog.snapshot returns just the current snapshot. You can have more files present in table's directory, those belong to historical versions that have been deleted/replaced from the current snapshot.

Also it returns 0 without complaints for non-delta paths. So I'm using this piece of code to get a database-level summary:

import com.databricks.sql.transaction.tahoe._
 
val databasePath = "dbfs:/<path-to-database>"
 
def size(path: String): Long =
  dbutils.fs.ls(path).map { fi => if (fi.isDir) size(fi.path) else fi.size }.sum
 
val tables = dbutils.fs.ls(databasePath).par.map { fi =>
  val totalSize = size(fi.path)
  val snapshotSize = DeltaLog.forTable(spark, fi.path).snapshot.sizeInBytes
  (fi.name, totalSize / 1024 / 1024 / 1024, snapshotSize / 1024 / 1024 / 1024)
}
display(tables.seq.sorted.toDF("name", "total_size_gb", "snapshot_size_gb"))

This does parallelize on driver only, still it's only file listing, so it's pretty fast. I admit I don't have a billion files, but well, if it's slow for you just use a bigger driver and tune the number of threads.

Upvotes: 1

Related Questions