Get Metadata from S3 parquet file using Pyarrow

Question

I have a parquet file in s3 to which I will be automatically appending additional data every week. The data has timestamps at 5-minute intervals. I do not want to append any duplicate data during my updates, so what I am trying to accomplish is read ONLY the max/oldest timestamp within the data saved in s3. Then, I will make sure that all of the timestamps in the data I will be appending are older than that time before appending. I don't want to read the entire dataset from s3 in an effort to increase speed/preserve memory as the dataset continues to grow.

Here is an example of what I am doing now to read the entire file:

from pyarrow import fs 
import pyarrow.parquet as pq

s3, path = fs.S3FileSystem(access_key, secret_key).from_uri(uri)
dataset = pq.ParquetDataset(path, filesystem=s3)
table = dataset.read()

But I am looking for something more like this (I am aware this isn't correct, but hopefully it conveys what I am attempting to accomplish):

max_date = pq.ParquetFile(path, filesystem=s3).metadata.row_group(0).column('timestamp').statistics['max']

I am pretty new to using both Pyarrow and AWS, so any help would be fantastic (including alternate solutions to my problem I described).

Get Metadata from S3 parquet file using Pyarrow

Answers (1)

Related Questions