Reputation: 582
I'm reading a CSV extract using sparkSession
and the framework is trying to create 10K partitions/tasks because of this huge task count, the spark job was failing with java.lang.OutOfMemoryError: GC overhead limit exceeded
.
Below is the sample code which is creating 10K tasks while reading the extract:
sparkSession
.read
.csv(dfwAbsHdfsPath)
Is there any way I can reduce or configure the tasks/partitions count? Just to add, we are using Spark version 2.3.1.
Upvotes: 2
Views: 164
Reputation: 6998
You can change the default parallelism in the config like this:
.config("spark.default.parallelism", NUMBER)
This influences the partitioning.
Upvotes: 1