Reputation: 31
the log of a delta table stores metadata about the transactions and about statistics (data type, min, max, nr. columns etc). However, I can only see the data types when looking into the json file of this log. Does anyone know how to obtain the min, max and nr. columns of this delta table without computing anything (since the delta table should have this information when reading the file)?
Upvotes: 3
Views: 1399
Reputation: 365
With the release of data skipping in Delta Lake 1.2.0, column-level statistics like min/max are now available. Statistics are saved in the Delta Lake transaction log every time an add
action is performed corresponding to adding a new parquet file. Using the delta-rs Python package (v0.8.0 or newer) we can retrieve and read these add
action records from the transaction log to view the file-level statistics. First, install the package with pip or conda:
# using pip:
$ pip install deltalake
# using conda:
$ conda install -c conda-forge delta-spark
We'll create a dummy Delta Table using pandas and delta-rs. Then we'll read the add
action records using the get_add_actions
method:
from deltalake import DeltaTable, write_deltalake
import pandas as pd
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
write_deltalake("tmp", df, partition_by=["x"])
dt = DeltaTable("tmp")
dt.get_add_actions().to_pandas()
You'll get the following output (exported as tsv for inclusion):
path size_bytes modification_time data_change partition_values num_records null_count min max
0 x=1/0-84bac732-65a4-4217-8782-14933373ad4b-0.parquet 1867 2023-03-28 23:28:25.634 True {'x': 1} 1 {'y': 0} {'y': 4} {'y': 4}
1 x=2/0-84bac732-65a4-4217-8782-14933373ad4b-0.parquet 1867 2023-03-28 23:28:25.634 True {'x': 2} 1 {'y': 0} {'y': 5} {'y': 5}
2 x=3/0-84bac732-65a4-4217-8782-14933373ad4b-0.parquet 1867 2023-03-28 23:28:25.634 True {'x': 3} 1 {'y': 0} {'y': 6} {'y': 6}
note: each statistic will have its own column in the DataFrame, e.g. num_records
, null_count
, min
, and max
.
Upvotes: 3
Reputation: 87259
This depends on if you are using open source version or Databricks version. The former don’t have this functionality, it exists only in DB version
Upvotes: 0