I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks. Assuming I want to use Spark SQL and parquet, what's the best way to achieve this? give up on concurrent reads/writes and append new data to the existing parquet file. create a new parquet file for each day of data, and use the fact that Spark can load multiple parquet files to allow me to load e.g. an entire year. This effectively gives me "concurrency". something else? Feel free to suggest other options, but let's assume I'm using parquet for now, as from what I've read this will be helpful to many others.

user4081921

Reputation:

Design of Spark + Parquet "database"

I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks.

Assuming I want to use Spark SQL and parquet, what's the best way to achieve this?

give up on concurrent reads/writes and append new data to the existing parquet file.
create a new parquet file for each day of data, and use the fact that Spark can load multiple parquet files to allow me to load e.g. an entire year. This effectively gives me "concurrency".
something else?

Feel free to suggest other options, but let's assume I'm using parquet for now, as from what I've read this will be helpful to many others.

Upvotes: 5

Answers (2)

dalin qin

Reputation: 126

I have very similar requirement in my system. I would say if load the whole year's data -for 100g one day that will be 36T data ,if you need to load 36TB daily ,that couldn't be fast anyway. better to save the processed daily data somewhere(such as count ,sum, distinct result) and use that to go back for whole year .

Upvotes: 0

Aravind Yarram

Reputation: 80192

My Level 0 design of this

Use partitioning by date/time (if your queries are based on date/time to avoid scanning of all data)
Use Append SaveMode where required
Run SparkSQL distributed SQL engine so that
1. You enable querying of the data from multiple clients/applications/users
2. cache the data only once across all clients/applications/users
Use just HDFS if you can to store all your Parquet files

Upvotes: 3

Design of Spark + Parquet &quot;database&quot;

Answers (2)

Related Questions

Design of Spark + Parquet "database"