akoeltringer
akoeltringer

Reputation: 1721

Add data to Spark/Parquet Data stored on Disk

I am in a situation similar to the one mentioned here. The question is not answered satisfactorily. Besides, I have less data to handle (around 1G a day).

My situation: I have a certain amount of data (~500G) already available as parquet (that is the "storage format" that was agreed on) and I get periodic incremental updates. I want to be able to handle the ETL part as well as the analytics part afterwards.

In order to be able to also efficiently produce updates on certain "intermediate data products", I see three options:

I am tending towards the last approach, also because there are comments (evidence?) that append mode leads to problems when the amount of partitions grows (see for example this SO question).

So my question: is there some serious drawback in creating a dataset structure like this? Obviously, Spark needs to do "some" merging/sorting when reading over multiple folders, but besides that?

I am using Spark 2.1.0.

Upvotes: 2

Views: 624

Answers (2)

Vishnu Prathish
Vishnu Prathish

Reputation: 369

I've noticed that the larger number of folders inside the directory, longer it takes for spark.read to execute because spark samples the data/metadata to figure out the schema. But that may be an inevitability that you have to deal with.

if you add an upload-timestamp or even better upload-date-hour and partition by it, it will naturally write to that folder. If there is a chance that multiple set of files could come in a given hour, make sure that the write is accessed through an api that ensures a union is performed on existing data before its written down.

Upvotes: 0

Vidya
Vidya

Reputation: 30300

Nathan Marz, formerly of Twitter and author of the Lambda Architecture, describes the process of vertical partitioning for storing data in the Batch Layer, which is the source of truth and contains all data the architecture has ever seen. This master dataset is immutable and append-only. Vertical partitioning is just a fancy name for sorting the data into separate folders.

That is exactly what you describe in your third option.

Doing this makes for significant performance gains because functions performed on the master dataset will only access those data relevant to the computation. This makes batch queries and the creation of indexes in the Serving Layer much faster. The names of the folders are up to you, but typically there is a timestamp component involved.

Whether you're building a Lambda Architecture or not, vertical partitioning will make your analytics a lot more efficient.

Upvotes: 1

Related Questions