user976850
user976850

Reputation: 1106

Which levels does a Parquet file store min/max/distinct (etc.) statistics on?

I know a Parquet file stores column statistics on the column level inside each Row Group, to allow more efficient queries on top of the data.

Does it also store column statistics on the file level (to avoid reading entire files unnecessarily)? How about the column page level?

Upvotes: 2

Views: 5879

Answers (1)

Zoltan
Zoltan

Reputation: 3105

Parquet indeed stores min/max statistics for row groups, but those are not stored inside the row groups themselves but in the file footer instead. As a result, if none of the row groups match, then it is not necessary to read any part of the file other than the footer. There is no need for separate min/max statistics for the whole file for this, the row-groups-level stats solve this problem, since row groups are generally large.

Page-level min/max statistics exist as well, but are called column indexes and are only implemented in the unreleased 1.11.0 release candidate. They are a little bit more complicated than row group level min/max statistics, since row boundaries are not aligned with page boundaries, which necessitates extra data structures for finding corresponding values in all requested columns. In any case, this feature allows pinpointing the page-level location of data and radically improves the performance of highly selective queries.

Upvotes: 6

Related Questions