Is there a way in pyarrow to query the values of parquet dataset partitions?

Question

For example, I have a dataset look like this:

dataset
    ├── a=1
    │    └── 1.parquet
    ├── a=2
    │    └── 2.parquet
    ├── a=3
         └── 3.parquet

and it's loaded in as dataset = pyarrow.parquet.ParquetDataset('./dataset') How do I query the available entries of partition "a" without reading the whole dataset into memory? Thanks~

Wes McKinney · Accepted Answer

See the pieces attribute of ParquetDataset. The partition_keys attribute of each ParquetDatasetPiece will give you the value of each partition key. If you have ideas about an API to make this simpler, please open a JIRA issue in Apache Arrow.

See also https://issues.apache.org/jira/browse/ARROW-1956 about reading specific portions of a partitioned dataset.

Is there a way in pyarrow to query the values of parquet dataset partitions?

Answers (1)

Related Questions