Michael
Michael

Reputation: 869

Difference between <Spark Dataframe>.write.parquet(<directory>) and <Spark Dataframe>.write.parquet(<file name>.parquet)

I've finally been introduced to parquet and am trying to understand it better. I realize that when running spark it is best to have at least as many parquet files (partitions) as you do cores to utilize spark to it's fullest. However, are there any advantages/disadvantages to making one large parquet file vs several smaller parquet files to store the data?

As a test I'm using this dataset:
https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-01.parquet

This is the code I'm testing with:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()
df = spark.read.parquet('fhvhv_tripdata_2021-01.parquet')
df.write.parquet('test.parquet')
df.write.parquet('./test')

When I ls -lh the files I see that: the test.parquet file is 4.0K enter image description here

and the two files created by writing to a directory are: 2.5K and 189M enter image description here

When I read these back into different dataframes they have the same count.

enter image description here

When is it best practice to do one over the other? What is the best practice to balance the file sizes when writing to a directory and should you? Any guidance/rules of thumb to use when writing/reading parquet files is greatly appreciated.

Upvotes: 0

Views: 1822

Answers (1)

Anjaneya Tripathi
Anjaneya Tripathi

Reputation: 1459

In spark you can use repartition to break the files in nearly equal chunks and as suggested in databricks training you can pick number of cores and use that number to repartition your file ,as the default shuffle partition is set to 200 which is bit high unless lots of data is present.

One specific gotcha with repartition is when your dataframe has complex data types and those have data in large variation of size for which you can refer to this question on stack

Upvotes: 1

Related Questions