Reputation: 139
I would like to know if there is any option available in order to query a MinIO database that stores DeltaTables in parquet format.
Currently I am using pyarrow with pandas but is really slow when the data become larger. I saw that PySpark can be used to query the DeltaTables but I would like to know if there are any other options.
Thanks
Upvotes: 0
Views: 579
Reputation: 11
It could depend how big the scale of the data you are dealing with, for big enough data sets you could try using presto for SQL syntax queries of from a MinIO source parquet files, using Hive Connector here is a how to:
https://blog.min.io/interactive-sql-query-with-presto-on-minio-cloud-storage/
Also, when you hit a large dataset could take advantage of Hive partition folder naming convention (ie. s3://bucketname/year=2019/ )to reduce the size of the data set needed to be queried, here is the docs regarding partitioning in in hive connector.
Unrelated note: credits to this question for help me remember the convention name
Upvotes: 1