user179156
user179156

Reputation: 871

running interactive sql queries over millions of parquet files

I have millions of streaming parquet files being written . I want to support running ad hoc interactive queries for debugging and analytics purpose ( added bonus if i can run streaming queries for some real time monitoring of key metrics as well). What is a scalable solution for supporting this. The two ways I have observed is running spark sql interactively over millions of parquet files (not too familiar with spark ecosystem but does this mean running a spark job for every sql user submits or do i need to run some streaming job and submit queries somehow) and second being using a presto sql engine on top of parquet (not exactly sure how presto ingests new incoming parquet files). Any recommendations or pros and cons of either approach . Any better solutions considering i have > ~10Tb data produced every day .

Upvotes: 1

Views: 1398

Answers (1)

Ahmed Kamal
Ahmed Kamal

Reputation: 1488

Let me address your use cases :

Support running ad hoc interactive queries for debugging and analytics purpose

I would recommend building a presto cluster if you care about minimizing the latency of your queries and are willing to invest in many machines with a large amount of memory.

Reason: Presto would run fully in-memory without touching disk (in most cases)

A Spark Cluster can also do the job, however, it won't be as fast as Presto. The advantage of Spark over presto is its fault tolerance capabilities and its ability to fail over to disk in case of out of memory conditions which may be important for you given that you have too much data.

Run streaming queries for some real-time monitoring of key metrics as well

As long as you have basic queries, you can build dashboards on top of Presto which could run these queries every x minutes.

Having a considerable amount of processing may be a good reason to look at Spark streaming if real-time monitoring is important.

If it isn't then you could build an ETL (using Spark) for calculating your metrics, storing the data as a new hive table and then expose for querying via Presto/SparkSQL again.

How presto ingests new incoming parquet files?

I'm now aware of your architecture, but in any case, you need to provide Presto with a Hive connection (Hive Metastore to be precise).

Hive provides Presto with few schemas attached to the directories where you ingest your data. Presto dynamically sees the new data by default. Spark is not different by the way.

Presto has nothing to do with data ingestion. It only starts its job once the data is there.

Upvotes: 1

Related Questions