Parquet compression performance grouped vs flat data

Question

Couldn't really get a straight answer from the net. Consider the following data scenario: I have data which contains user_id and timestamps of user activity:

val bigData = Seq( ( "id3",12),
                 ("id1",55),
                 ("id1",59),
                 ("id1",50),
                 ("id2",51),
                 ("id3",52),
                 ("id2",53),
                 ("id1",54),
              ("id2", 34)).toDF("user_id", "ts")

So the original DataFrame looks like this:

+-------+---+
|user_id| ts|
+-------+---+
|    id3| 12|
|    id1| 55|
|    id1| 59|
|    id1| 50|
|    id2| 51|
|    id3| 52|
|    id2| 53|
|    id1| 54|
|    id2| 34|
+-------+---+

and this is what I will write to HDFS\S3 for example.

However I can't save the data grouped by user for instance like this:

bigData.groupBy("user_id").agg(collect_list("ts") as "ts")

Which result in:

+-------+----------------+
|user_id|              ts|
+-------+----------------+
|    id3|        [12, 52]|
|    id1|[55, 59, 50, 54]|
|    id2|    [51, 53, 34]|
+-------+----------------+

I can get a decisive answer on which method will get better storage/compression on the filesystem. The grouped approach looks (intuitively) better storage/comperssion wise.

Anyone know if there is absolute approch or know any benchmarks or articles on this subject?

Avishek Bhattacharya · Accepted Answer

Let's consider the first case where the data stored in flat structure. If you sort the data w.r.r the id then same ids will go to the same partition. This will result in Parquet dictionary compression which will bring down the size.

Moreover, if your ts is bounded then parquet format keep the base and create offsets.

For example

50 51 52 60 are the ts
Parquet saves : base: 50, offset: 0, 1, 2, 8

This might save more space if the offsets can be represented in 2Bytes.

The other format is also valid. But the only thing is since parquet is a columnar format the bigger the column value, parquet will create padding for rest of the column values

For example

ts
----
[20], 
[20,40,60,70,80]

parquet will create padding for 20 and keep that as the same size of [20,40,60,70,80].

I would recommend you to run various experiments on the dataset , measure the size and check the parquet footer. You will get great insights how parquet is storing the data for your application. The thing is the data size will depend on the underlying data so we may not get a conclusive answer.

Parquet compression performance grouped vs flat data

Answers (1)

Related Questions