Is it better to partition by time stamp or year,month,day, hour

Question

I am working on a spark application that writes the processed data in parquet files and queries on data are always about a time period. I would like to partition it by date-time hence. Which of the two approaches are better?

DataFrame: |CreationTime(java.sql.Timestamp)| Data(String)|

1) dataframe.write.partitionBy("CreationTime").parquet(path)

2) dataframe.withColumn("year", year("CreationTime")) .withColumn("month", month("CreationTime")) .withColumn("day", day("CreationTime")) .write.partitionBy("year","month","day").parquet(path)

In the second case, the reader gets complex and it has to do startTime.year > col("year") and endTime.year < col("year) and similarly for month and date.

What is the common practice and recommendation partitioning the data in such cases?

GPI · Accepted Answer

Aiming for 1GB each 10 minutes means you'll very quickly build up a pretty massive amount of data (1000 files and 1 TB a week, give or take).

Your choices have to take into account at least :

The queries you'll have to support (will they - almost - always have a time range, or never ?)
The storage you'll dump it to (will you have to split accross multiple datacenters / S3 buckets / ...)
The size that your partitionning scheme will add on top of your data

As an example, creating a year column, a month column, a day column, an hour column, and a minutes columns means creating 5 columns, each of a numberic type, for each of your records. Maybe it's negligible, may be not at all. And you'll have to pay for this increase in storage at write time, the associated bandwidth and CPU parsing it at readtime, for data that will not help building any functionnality.

On the other hand, it would be a very readable / debug-able way of storing this data.

You can have another, simpler, strategy, which would be to have a single number reprensenting each 10 minutes frame : yourTimestampInMillisSinceEpoch / TimeUnit.MINUTES.convert(10, TimeUnit.MILLISECONDS). That would only cost you one number.

I would recommand this "single column" strategy, but I would not stop there.

Before actually writing the files at a single path, I would split the dataframe by day and by month. Not that I actually would need to store the day and month (I'd drop them before writing the result), but I would use them to build my parquet folder path, something along the line of (hdfs or S3, etc...)://your_root/{year}/{month}/{day}/. (A particularly efficient way of doing this is to leverage partition discovery at the spark level, which allows Spark to use the directory path to "infer" "virtual" columns based on it, and optimze the querying accordingly).

That would allow me, if I ever need to query inside a time range, to leverage HDFS globs, in order to not even start reading files that have no chance of holding any interesing data of that given time range.

Dumping all data at a single path would prevent this optimization, you'd need to list a very large directory, then open each of its files to see what time window it holds, which will be a very large waste of CPU, bandwidth, performance (and, yes, down the road, money).

Is it better to partition by time stamp or year,month,day, hour

Answers (1)

Related Questions