How many partitions when reading parquet data from Spark

Question

I'm using Spark 1.6.0. and DataFrame API for reading partitioned parquet data.

I'm wondering how many partitions will be used.

Here are some figures on my data :

It seems that Spark uses 2182 partitions because when I perform a count, the job is splitted into 2182 tasks.

That's seem to be confirmed by df.rdd.partitions.length

Is that correct ? In all cases ?

If yes, is it too high regarding the volume of data (i.e. should I use df.repartition to reduce it) ?

Answers (1)