yvdw
yvdw

Reputation: 31

Delta table statistics

the log of a delta table stores metadata about the transactions and about statistics (data type, min, max, nr. columns etc). However, I can only see the data types when looking into the json file of this log. Does anyone know how to obtain the min, max and nr. columns of this delta table without computing anything (since the delta table should have this information when reading the file)?

Upvotes: 3

Views: 1399

Answers (2)

Jim Hibbard
Jim Hibbard

Reputation: 365

With the release of data skipping in Delta Lake 1.2.0, column-level statistics like min/max are now available. Statistics are saved in the Delta Lake transaction log every time an add action is performed corresponding to adding a new parquet file. Using the delta-rs Python package (v0.8.0 or newer) we can retrieve and read these add action records from the transaction log to view the file-level statistics. First, install the package with pip or conda:

# using pip:
$ pip install deltalake

# using conda:
$ conda install -c conda-forge delta-spark

We'll create a dummy Delta Table using pandas and delta-rs. Then we'll read the add action records using the get_add_actions method:

from deltalake import DeltaTable, write_deltalake
import pandas as pd

df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
write_deltalake("tmp", df, partition_by=["x"])
dt = DeltaTable("tmp")
dt.get_add_actions().to_pandas()

You'll get the following output (exported as tsv for inclusion):

    path    size_bytes  modification_time   data_change partition_values    num_records null_count  min max
0   x=1/0-84bac732-65a4-4217-8782-14933373ad4b-0.parquet    1867    2023-03-28 23:28:25.634 True    {'x': 1}    1   {'y': 0}    {'y': 4}    {'y': 4}
1   x=2/0-84bac732-65a4-4217-8782-14933373ad4b-0.parquet    1867    2023-03-28 23:28:25.634 True    {'x': 2}    1   {'y': 0}    {'y': 5}    {'y': 5}
2   x=3/0-84bac732-65a4-4217-8782-14933373ad4b-0.parquet    1867    2023-03-28 23:28:25.634 True    {'x': 3}    1   {'y': 0}    {'y': 6}    {'y': 6}

note: each statistic will have its own column in the DataFrame, e.g. num_records, null_count, min, and max.

Upvotes: 3

Alex Ott
Alex Ott

Reputation: 87259

This depends on if you are using open source version or Databricks version. The former don’t have this functionality, it exists only in DB version

Upvotes: 0

Related Questions