Reputation: 10703
I would like to check how each column of parquet data contributes to total file size / total table size.
I looked through Spark/Databricks commands, parquet-cli, parquet-tools and unfortunately it seems that none of them provide such information directly. Considering that this is a columnar format, it should be possible to pull out somehow.
So far the closest I got would be to run parquet-tools meta
, summing up details by column for each row group within the file, then aggregating this for all files of a table. This means iterating on all parquet files and cumbersome parsing of the output.
Maybe there is an easier way?
Upvotes: 1
Views: 1731
Reputation: 8804
Your approach is correct. Here is a py script using DuckDB to find overall compressed and uncompressed size of all the columns in a parquet dataset.
import duckdb
con = duckdb.connect(database=':memory:')
print(con.execute("""SELECT SUM(total_compressed_size) AS
total_compressed_size_in_bytes, SUM(total_uncompressed_size) AS
total_uncompressed_size_in_bytes, path_in_schema AS column_name from
parquet_metadata('D:\\dev\\tmp\\parq_dataset\\*') GROUP BY path_in_schema""").df())
D:\\dev\\tmp\\parq_dataset\\*
here parq_dataset
consists of multiple parquet files with same schema. Something similar should be possible using other libraries like pyarrow/fastparquet as well.
Upvotes: 3