Add data to Spark/Parquet Data stored on Disk

Question

I am in a situation similar to the one mentioned here. The question is not answered satisfactorily. Besides, I have less data to handle (around 1G a day).

My situation: I have a certain amount of data (~500G) already available as parquet (that is the "storage format" that was agreed on) and I get periodic incremental updates. I want to be able to handle the ETL part as well as the analytics part afterwards.

In order to be able to also efficiently produce updates on certain "intermediate data products", I see three options:

save with append mode, keeping a diff dataset around until all data products were created
save with append mode, adding an extra column upload_timestamp
save each update to a separate folder, e.g.:
```
data
 +- part_001
 |   +- various_files.parquet
 +- part_002
 |   +- various_files.parquet
 +- ...
```
This way the entire dataset can be read using data/* as path to read.parquet().

I am tending towards the last approach, also because there are comments (evidence?) that append mode leads to problems when the amount of partitions grows (see for example this SO question).

So my question: is there some serious drawback in creating a dataset structure like this? Obviously, Spark needs to do "some" merging/sorting when reading over multiple folders, but besides that?

I am using Spark 2.1.0.

Vidya · Accepted Answer

Nathan Marz, formerly of Twitter and author of the Lambda Architecture, describes the process of vertical partitioning for storing data in the Batch Layer, which is the source of truth and contains all data the architecture has ever seen. This master dataset is immutable and append-only. Vertical partitioning is just a fancy name for sorting the data into separate folders.

That is exactly what you describe in your third option.

Doing this makes for significant performance gains because functions performed on the master dataset will only access those data relevant to the computation. This makes batch queries and the creation of indexes in the Serving Layer much faster. The names of the folders are up to you, but typically there is a timestamp component involved.

Whether you're building a Lambda Architecture or not, vertical partitioning will make your analytics a lot more efficient.

Add data to Spark/Parquet Data stored on Disk

Answers (2)

Related Questions