Reputation: 8281
I'm using Spark 1.6.0. and DataFrame API for reading partitioned parquet data.
I'm wondering how many partitions will be used.
Here are some figures on my data :
It seems that Spark uses 2182 partitions because when I perform a count
, the job is splitted into 2182 tasks.
That's seem to be confirmed by df.rdd.partitions.length
Is that correct ? In all cases ?
If yes, is it too high regarding the volume of data (i.e. should I use df.repartition
to reduce it) ?
Upvotes: 0
Views: 1543
Reputation: 548
yes you can use the re-partition method to reduce the number of tasks such that it is in balance with available resources. you also need to define the number of executor per node, no. of nodes and memory per node while submitting the app so that the tasks will execute in parallel and utilise maximum resources.
Upvotes: 1