SarahData
SarahData

Reputation: 809

Repartitioning a pyspark dataframe fails and how to avoid the initial partition size

I'm trying to tune the performance of spark, by the use of partitioning on a spark dataframe. Here is the code:

file_path1 = spark.read.parquet(*paths[:15])
df = file_path1.select(columns) \
    .where((func.col("organization") == organization)) 
df = df.repartition(10)
#execute an action just to make spark execute the repartition step
df.first()

During the execution of first() I check the job stages in Spark UI and here what I find: Job details stage 7 details

Note: initially the Dataframe is read from a selection of parquet files in Hadoop.

I already read this as reference How does Spark partition(ing) work on files in HDFS?

Upvotes: 0

Views: 2165

Answers (1)

firsni
firsni

Reputation: 916

  • Whenever there is shuffling, there is a new stage. and the
    repartition causes shuffling that"s why you have two stages.
  • the caching is used when you'll use the dataframe multiple times to avoid reading it twice.

Use coalesce instead of repartiton. I think it causes less shuffling since it only reduces the number of partitions.

Upvotes: 0

Related Questions