Reputation: 93
When writing a metadata file, the ThriftParquetWriter actually generates two files: _metadata and _common_metadata
Whats the difference between this two files? They have a different file size so there must be a difference
Thanks
Upvotes: 4
Views: 3682
Reputation: 51
This does not appear to be the case. I am seeing _common_metadata only in hierarchical sets (where there are columns encoded as directory names). The _common_metadata contains the schema for the whole table, including those hierarchical columns, while _metadata contains the schema used for part files (omitting the hierarchical columns) and also includes per-file column stats (min, max, etc) for all the files, with their complete relative path names.
Upvotes: 0
Reputation: 3110
In looking at the source code at https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java it seems to me that:
_common_metadata
contains the merged schemas for the parquet files in that directory
_metadata
will contain only the schema of the most recently written parquet file in that directory
Upvotes: 4