Henry
Henry

Reputation: 139

Query MinIO database without converting the files with Pandas

I would like to know if there is any option available in order to query a MinIO database that stores DeltaTables in parquet format.

Currently I am using pyarrow with pandas but is really slow when the data become larger. I saw that PySpark can be used to query the DeltaTables but I would like to know if there are any other options.

Thanks

Upvotes: 0

Views: 579

Answers (1)

Pedro Duran
Pedro Duran

Reputation: 11

It could depend how big the scale of the data you are dealing with, for big enough data sets you could try using presto for SQL syntax queries of from a MinIO source parquet files, using Hive Connector here is a how to:

https://blog.min.io/interactive-sql-query-with-presto-on-minio-cloud-storage/

Also, when you hit a large dataset could take advantage of Hive partition folder naming convention (ie. s3://bucketname/year=2019/ )to reduce the size of the data set needed to be queried, here is the docs regarding partitioning in in hive connector.

Unrelated note: credits to this question for help me remember the convention name

Upvotes: 1

Related Questions