Reputation: 629
I saved my dataframe as parquet format
df.write.parquet('/my/path')
When checking on HDFS, I can see that there is 10 part-xxx.snappy.parquet files under the parquet directory /my/path
My question is: is one part-xxx.snappy.parquet file correspond to a partition of my dataframe ?
Upvotes: 1
Views: 1998
Reputation: 5093
Yes, this creates one file per Spark-partition.
Note, that you can also partition files by some attribute:
df.write.partitionBy("key").parquet("/my/path")
in such case Spark is going to create up to Spark-partition number of files for each parquet-partition. Common way to reduce number of files in such case is to repartition data by key before writing (this effectively creates one file per partition).
Upvotes: 1
Reputation: 31550
Yes, part-** files are created based on number of partitions
in the dataframe while writing to HDFS.
To check number of partitions
in the dataframe:
df.rdd.getNumPartitions()
To control number of files writing to filesystem we can use .repartition (or) .coalesce() (or) dynamically
based on our requirement.
Upvotes: 2