Reputation: 27373
According to the docs of Spark 1.6.3, repartition(partitionExprs: Column*)
should preserve the number of partitions in the resulting dataframe:
Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions
(taken from https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.sql.DataFrame)
But the following example seems to show something else (note that spark-master is local[4]
in my case):
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[4]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x")
myDF.rdd.getNumPartitions // 4
myDF.repartition($"x").rdd.getNumPartitions // 200 !
How can that be explained? I'm using Spark 1.6.3 as a standalone application (i.e. running locally in IntelliJ IDEA)
Edit: This question does not adress the issue from Dropping empty DataFrame partitions in Apache Spark (i.e. how to repartiton along a column without producing empty partitions), but why the docs say something different from what I observe in my example
Upvotes: 2
Views: 4089
Reputation: 2281
It is something related to Tungsten project which was enabled in Spark. It uses hardware optimization and calls hash partitioning which triggers shuffle operation. By default spark.sql.shuffle.partitions is set to be 200. You can verify by calling explain on your dataframe before repartitioning and after just calling:
myDF.explain
val repartitionedDF = myDF.repartition($"x")
repartitionedDF.explain
Upvotes: 1