Spark DataFrame repartition : number of partition not preserved

Question

According to the docs of Spark 1.6.3, repartition(partitionExprs: Column*) should preserve the number of partitions in the resulting dataframe:

Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions

(taken from https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.sql.DataFrame)

But the following example seems to show something else (note that spark-master is local[4] in my case):

val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[4]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x")
myDF.rdd.getNumPartitions // 4 
myDF.repartition($"x").rdd.getNumPartitions //  200 !

How can that be explained? I'm using Spark 1.6.3 as a standalone application (i.e. running locally in IntelliJ IDEA)

Edit: This question does not adress the issue from Dropping empty DataFrame partitions in Apache Spark (i.e. how to repartiton along a column without producing empty partitions), but why the docs say something different from what I observe in my example

Spark DataFrame repartition : number of partition not preserved

Answers (1)

Related Questions