Reputation: 475
For example, I have a dataset look like this:
dataset
├── a=1
│ └── 1.parquet
├── a=2
│ └── 2.parquet
├── a=3
└── 3.parquet
and it's loaded in as dataset = pyarrow.parquet.ParquetDataset('./dataset')
How do I query the available entries of partition "a
" without reading the whole dataset into memory? Thanks~
Upvotes: 2
Views: 1506
Reputation: 105501
See the pieces
attribute of ParquetDataset
. The partition_keys
attribute of each ParquetDatasetPiece
will give you the value of each partition key. If you have ideas about an API to make this simpler, please open a JIRA issue in Apache Arrow.
See also https://issues.apache.org/jira/browse/ARROW-1956 about reading specific portions of a partitioned dataset.
Upvotes: 3