Sasha Chernin
Sasha Chernin

Reputation: 77

Is there a tool to query Parquet files which are hosted in S3 storage?

I have Paraquet files in my S3 bucket which is not AWS S3.

Is there a tool that connects to any S3 service (like Wasabi, Digital Ocean, MinIO), and allows me to query the Parquet files?

Upvotes: 4

Views: 10509

Answers (3)

Ryan Hamilton
Ryan Hamilton

Reputation: 2615

QStudio is a free SQL client that bundles DuckDB as part of the download: https://www.timestored.com/qstudio/help/duckdb-sql-editor

Go File Menu -> New DuckDB database then use the commands ns15 provided:

-- Highlight and press Control+E to Execute all commands one by one
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-east-1';
SET s3_access_key_id='****';
SET s3_secret_access_key='****';

-- Read the data using S3 protocol
select * from read_parquet('s3://test-me-ses/userdata1.parquet');
select * from read_parquet('https://www.timestored.com/data/sample/titanic.parquet');

QStudio Parquet Query

If you install QStudio in windows it adds a parquet association that allows just double clicking to open and query a parquet file: https://www.timestored.com/qstudio/parquet-file-viewer

Disclaimer: I'm the author since 2013.

Upvotes: 0

ns15
ns15

Reputation: 8844

In case you need a GUI tool then you can use DBeaver + DuckDB. For programmatic use, You can find DuckDB library for most languages.

Here is my other answer on the same topic.

There is a slight difference since you are querying data on a S3 compatible storage. You simply need to run few additional commands mentioned here in the DuckDB docs.

-- Execute these commnds one by one or run as a script from DBeaver
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-east-1';
SET s3_access_key_id='****';
SET s3_secret_access_key='****';

-- Read the data using S3 protocol
select * from read_parquet('s3://test-me-ses/userdata1.parquet');

In case you have parquet files hosted and served from S3 or any web server via HTTP - DuckDB has this covered as well.

-- Read the data using HTTP protocol if parquet file is hosted
SELECT * FROM read_parquet('https://test-me-ses.s3.amazonaws.com/userdata1.parquet');

Any S3 compatible object store(Wasabi, Digital Ocean, MinIO) should work similarly..

enter image description here

You can also write the data back as parquet after transformation to any S3 compatible storage(AWS, MinIO etc..).

All of these can be done programmatically as well.

Upvotes: 5

Joe Drumgoole
Joe Drumgoole

Reputation: 1348

With MongoDB this can be done with our Atlas Data Federation product https://www.mongodb.com/docs/atlas/data-federation/overview/

It can query parquet files stored in S3.

Upvotes: 0

Related Questions