abc123
abc123

Reputation: 587

How to read parquet results from S3 which are paginated

My results are stored in Amazon S3 in parquet format.

My Requirements are as follows :

  1. I have a S3 bucket where I store my result as parquet (multiple parquet parts). I want to retrieve the results in all the parts.
  2. I want to retrieve all rows (in all the parts) as they are. (Doing query would be nice)
  3. My desire to paginate comes from my environment which is non distributed. I have an EC2 instance that has java code to get the results. I need the results to be paginated so that the EC2 host does not crash while retrieving the result.

Options I looked into:

  1. ListObjectsV2Request - can't use this yet because we have not upgraded to AWS Java SDK 2.0

  2. Looking into S3 Select - Since S3 select needs the exact key of the contents I want to retrieve, first I will have to list all the parts from S3 and then use S3 Select on each part to get the results. Also I am not sure how I will paginate the input stream provided by S3

  3. Also looking into Read parquet data from AWS s3 bucket but I am not clear on how to paginate the results.

Any input/help will be highly appreciated.

Upvotes: 1

Views: 1876

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 269360

This sounds like an excellent use-case for Amazon Athena. It can:

  • Read Parquet files
  • Treat multiple files in a directory as a single source of data
  • Allow querying of data to only retrieve desired results (it can also JOIN tables)
  • It can return paginated results

See:

Upvotes: 2

Related Questions