user3169506
user3169506

Reputation: 93

Parquet: difference between metadata and common_metadata

When writing a metadata file, the ThriftParquetWriter actually generates two files: _metadata and _common_metadata

https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

Whats the difference between this two files? They have a different file size so there must be a difference

Thanks

Upvotes: 4

Views: 3682

Answers (2)

Paul Chambre
Paul Chambre

Reputation: 51

This does not appear to be the case. I am seeing _common_metadata only in hierarchical sets (where there are columns encoded as directory names). The _common_metadata contains the schema for the whole table, including those hierarchical columns, while _metadata contains the schema used for part files (omitting the hierarchical columns) and also includes per-file column stats (min, max, etc) for all the files, with their complete relative path names.

Upvotes: 0

James Tobin
James Tobin

Reputation: 3110

In looking at the source code at https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java it seems to me that:

_common_metadata contains the merged schemas for the parquet files in that directory

_metadata will contain only the schema of the most recently written parquet file in that directory

Upvotes: 4

Related Questions