Reputation: 1
I am starting my spark shell using "start_pyspark_shell" command and giving cli options as - 4 executors, 2 cores per executor & 4GB of memory for worker nodes & 4GB for master
Storage: HDFS
Input file: A compressed .csv.gz file of size 221.3 MB (2 blocks on HDFS) &
Spart version: 2.4.0
The task at hand is a simple one to count the number of records in the file. The only catch is that its a compressed file. I loaded the file using
df = spark.read.format("com.databricks.spark.csv").load(hdfs_path)
When I did df.count()
, I see that there is a single executor task and is probably expected(?) since I am working on a compressed file which is not splittable and will be operated upon with a single partition?
I checked the number of partitions - df.rdd.getNumPartitions()
and it returned 1, probably as expected.
The processing time was around 15-17 seconds with multiple runs of the same command.
I think we can conclude here that there was not much of parallelism for the above processing ?
I now tried doing df.repartition(10).count()
with the expectation that the data will be repartitioned into 10 new partitions and probably across worker nodes. I could see that the number of TASKS is now according to the number of partitions I am specifying. I was hoping for some improved performance in terms of execution time. It turned out to be 25-26 seconds now.
When I used .repartition(20)
, it was running for more than 4 minutes and I had to kill it.
Performance is reduced. Did I do something wrong or did I miss any step to achieve improved performance?
Note: I did see some good existing posts on this but did not get clarity still. Hence posting a new query.
Upvotes: 0
Views: 457
Reputation: 1
The compressed files seem to be loaded into a single partition on a single executor. When we try to re-partition, we get more tasks running in parallel on different worker nodes, however, re-partitioning also incurred additional time for shuffling / copy of data to multiple worker nodes.
This seem to be the reason for longer processing time.
Conclusion: a) If the task / action is simple, it is not worth re-partitioning the data of a compressed file. b) If we have lot of processing down the line, the cost of re-partitioning is only once but multiple processing activities may get benefited and its worth the additional processing time.
Upvotes: 0