mentongwu
mentongwu

Reputation: 473

how to convert dataframe to RDD and don't change partition?

For some reason i have to convert RDD to dataframe,then do something with dataframe,but my interface is RDD,so i have to convert dataframe to RDD,when i use df.rdd,the partition change to 1,so i have to repartition and sortBy RDD,Is there any cleaner solution ?thanks! this is my try:

    val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
    val partition=rdd.getNumPartitions
    val sqlContext = new SQLContext(m_sparkCtx)
    import sqlContext.implicits._
    val df=rdd.toDF()
    df.rdd.zipWithIndex().sortBy(x => {x._2}, true, partition).map(x => {x._1})

Upvotes: 0

Views: 921

Answers (1)

rogue-one
rogue-one

Reputation: 11577

Partitions should remain the same when you convert the DataFrame to an RDD. For example when the rdd of 4 partitions is converted to DF and back the RDD the partitions of the RDD remains same as shown below.

scala> val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:27

scala> val partition=rdd.getNumPartitions
partition: Int = 4

scala> val df=rdd.toDF()
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.rdd.getNumPartitions
res1: Int = 4

scala> df.withColumn("col2", lit(10)).rdd.getNumPartitions
res1: Int = 4

Upvotes: 2

Related Questions