Madhu P
Madhu P

Reputation: 1

Handling compressed files in spark: Can repartition improve or detoriate performance

I am starting my spark shell using "start_pyspark_shell" command and giving cli options as - 4 executors, 2 cores per executor & 4GB of memory for worker nodes & 4GB for master

Storage: HDFS

Input file: A compressed .csv.gz file of size 221.3 MB (2 blocks on HDFS) &
Spart version: 2.4.0

The task at hand is a simple one to count the number of records in the file. The only catch is that its a compressed file. I loaded the file using

df = spark.read.format("com.databricks.spark.csv").load(hdfs_path)

When I did df.count(), I see that there is a single executor task and is probably expected(?) since I am working on a compressed file which is not splittable and will be operated upon with a single partition?

I checked the number of partitions - df.rdd.getNumPartitions() and it returned 1, probably as expected.

The processing time was around 15-17 seconds with multiple runs of the same command.

I think we can conclude here that there was not much of parallelism for the above processing ?

I now tried doing df.repartition(10).count() with the expectation that the data will be repartitioned into 10 new partitions and probably across worker nodes. I could see that the number of TASKS is now according to the number of partitions I am specifying. I was hoping for some improved performance in terms of execution time. It turned out to be 25-26 seconds now.

When I used .repartition(20), it was running for more than 4 minutes and I had to kill it.

Performance is reduced. Did I do something wrong or did I miss any step to achieve improved performance?

Note: I did see some good existing posts on this but did not get clarity still. Hence posting a new query.

Upvotes: 0

Views: 457

Answers (1)

Madhu P
Madhu P

Reputation: 1

The compressed files seem to be loaded into a single partition on a single executor. When we try to re-partition, we get more tasks running in parallel on different worker nodes, however, re-partitioning also incurred additional time for shuffling / copy of data to multiple worker nodes.

This seem to be the reason for longer processing time.

Conclusion: a) If the task / action is simple, it is not worth re-partitioning the data of a compressed file. b) If we have lot of processing down the line, the cost of re-partitioning is only once but multiple processing activities may get benefited and its worth the additional processing time.

Upvotes: 0

Related Questions