Handling compressed files in spark: Can repartition improve or detoriate performance

Question

I am starting my spark shell using "start_pyspark_shell" command and giving cli options as - 4 executors, 2 cores per executor & 4GB of memory for worker nodes & 4GB for master

Storage: HDFS

Input file: A compressed .csv.gz file of size 221.3 MB (2 blocks on HDFS) &
Spart version: 2.4.0

The task at hand is a simple one to count the number of records in the file. The only catch is that its a compressed file. I loaded the file using

df = spark.read.format("com.databricks.spark.csv").load(hdfs_path)

When I did df.count(), I see that there is a single executor task and is probably expected(?) since I am working on a compressed file which is not splittable and will be operated upon with a single partition?

I checked the number of partitions - df.rdd.getNumPartitions() and it returned 1, probably as expected.

The processing time was around 15-17 seconds with multiple runs of the same command.

I think we can conclude here that there was not much of parallelism for the above processing ?

I now tried doing df.repartition(10).count() with the expectation that the data will be repartitioned into 10 new partitions and probably across worker nodes. I could see that the number of TASKS is now according to the number of partitions I am specifying. I was hoping for some improved performance in terms of execution time. It turned out to be 25-26 seconds now.

When I used .repartition(20), it was running for more than 4 minutes and I had to kill it.

Performance is reduced. Did I do something wrong or did I miss any step to achieve improved performance?

Note: I did see some good existing posts on this but did not get clarity still. Hence posting a new query.

Handling compressed files in spark: Can repartition improve or detoriate performance

Answers (1)

Related Questions