Powers
Powers

Reputation: 19338

Spark performance enhancements by storing sorted Parquet files

Will data extracts run quicker if a DataFrame is sorted before being persisted as Parquet files.

Suppose we have the following peopleDf DataFrame (pretend this a sample and the real one has 20 billion rows):

+-----+----------------+
| age | favorite_color |
+-----+----------------+
|  54 | blue           |
|  10 | black          |
|  13 | blue           |
|  19 | red            |
|  89 | blue           |
+-----+----------------+

Let's write out sorted and unsorted versions of this DataFrame to Parquet files.

peopleDf.write.parquet("s3a://some-bucket/unsorted/")
peopleDf.sort($"favorite_color").write.parquet("s3a://some-bucket/sorted/")

Are there any performance gains when reading in the sorted data and doing a data extract based on favorite_color?

val pBlue1 = spark.read.parquet("s3a://some-bucket/unsorted/").filter($"favorite_color" === "blue")

// is this faster?

val pBlue2 = spark.read.parquet("s3a://some-bucket/sorted/").filter($"favorite_color" === "blue")

Upvotes: 8

Views: 6099

Answers (1)

user6022341
user6022341

Reputation:

Sorting provides a number of benefits:

  • more efficient filtering using file metadata.
  • more efficient compression rate.

If you want to filter on single column partitioning on that column can be more efficient and doesn't require shuffle although there some related issues right now:

Upvotes: 3

Related Questions