Reputation: 473
For some reason i have to convert RDD
to dataframe
,then do something with dataframe
,but my interface is RDD
,so i have to convert dataframe
to RDD
,when i use df.rdd
,the partition change to 1,so i have to repartition
and sortBy
RDD,Is there any cleaner solution ?thanks!
this is my try:
val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val partition=rdd.getNumPartitions
val sqlContext = new SQLContext(m_sparkCtx)
import sqlContext.implicits._
val df=rdd.toDF()
df.rdd.zipWithIndex().sortBy(x => {x._2}, true, partition).map(x => {x._1})
Upvotes: 0
Views: 921
Reputation: 11577
Partitions should remain the same when you convert the DataFrame to an RDD. For example when the rdd of 4 partitions is converted to DF and back the RDD the partitions of the RDD remains same as shown below.
scala> val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:27
scala> val partition=rdd.getNumPartitions
partition: Int = 4
scala> val df=rdd.toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.rdd.getNumPartitions
res1: Int = 4
scala> df.withColumn("col2", lit(10)).rdd.getNumPartitions
res1: Int = 4
Upvotes: 2