Anoop Deshpande
Anoop Deshpande

Reputation: 582

How to set partition count while reading CSV extract using SparkSession?

I'm reading a CSV extract using sparkSession and the framework is trying to create 10K partitions/tasks because of this huge task count, the spark job was failing with java.lang.OutOfMemoryError: GC overhead limit exceeded.

Below is the sample code which is creating 10K tasks while reading the extract:

sparkSession
    .read
    .csv(dfwAbsHdfsPath)

Is there any way I can reduce or configure the tasks/partitions count? Just to add, we are using Spark version 2.3.1.

Upvotes: 2

Views: 164

Answers (1)

Robert Kossendey
Robert Kossendey

Reputation: 6998

You can change the default parallelism in the config like this:

.config("spark.default.parallelism", NUMBER)

This influences the partitioning.

Upvotes: 1

Related Questions