Reputation: 869
I've finally been introduced to parquet and am trying to understand it better. I realize that when running spark it is best to have at least as many parquet files (partitions) as you do cores to utilize spark to it's fullest. However, are there any advantages/disadvantages to making one large parquet file vs several smaller parquet files to store the data?
As a test I'm using this dataset:
https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-01.parquet
This is the code I'm testing with:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName('test') \
.getOrCreate()
df = spark.read.parquet('fhvhv_tripdata_2021-01.parquet')
df.write.parquet('test.parquet')
df.write.parquet('./test')
When I ls -lh the files I see that:
the test.parquet file is 4.0K
and the two files created by writing to a directory are:
2.5K
and
189M
When I read these back into different dataframes they have the same count.
When is it best practice to do one over the other? What is the best practice to balance the file sizes when writing to a directory and should you? Any guidance/rules of thumb to use when writing/reading parquet files is greatly appreciated.
Upvotes: 0
Views: 1822
Reputation: 1459
In spark you can use repartition to break the files in nearly equal chunks and as suggested in databricks training you can pick number of cores and use that number to repartition your file ,as the default shuffle partition is set to 200 which is bit high unless lots of data is present.
One specific gotcha with repartition is when your dataframe has complex data types and those have data in large variation of size for which you can refer to this question on stack
Upvotes: 1