The Singularity
The Singularity

Reputation: 2698

Why do Parquet files generate multiple parts in Pyspark?

After some extensive research I have figured that

Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

However, I am unable to understand why parquet writes multiple files when I run df.write.parquet("/tmp/output/my_parquet.parquet") despite supporting flexible compression options and efficient encoding. Is this directly related to parallel processing or similar concepts?

Upvotes: 1

Views: 2632

Answers (2)

Michael Delgado
Michael Delgado

Reputation: 15452

Lots of frameworks make use of this multi-file layout feature of the parquet format. So I’d say that it’s a standard option which is part of the parquet specification, and spark uses it by default.

This does have benefits for parallel processing, but also other use cases, such as processing (in parallel or series) on the cloud or networked file systems, where data transfer times may be a significant portion of total IO. in these cases, the parquet “hive” format, which uses small metadata files which provide statistics and information about which data files to read, offers significant performance benefits when reading small subsets of the data. This is true whether a single-threaded application is reading a subset of the data or if each worker in a parallel process is reading a portion of the whole.

Upvotes: 2

Anjaneya Tripathi
Anjaneya Tripathi

Reputation: 1459

It's not just for parquet but rather a spark feature where to avoid network io it writes each shuffle partition as a 'part...' file on disk and each file as you said will have compression and efficient encoding by default.

So Yes it is directly related to parallel processing

Upvotes: 2

Related Questions