Unable to Convert to DataFrame from RDD after applying partitioning

Question

I am using Spark 2.1.0

When i am trying to use Window function on a Dataframe

val winspec = Window.partitionBy("partition_column")
DF.withColumn("column", avg(DF("col_name")).over(winspec))

My Plan changes and add the below lines to the Physical Plan and due to this An Extra Stage , EXtra Shuffling is happening and the Data is Huge which Slows down my Query like anything & runs for Hours.

+- Window [avg(cast(someColumn#262 as double)) windowspecdefinition(partition_column#460, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS someColumn#263], [partition_column#460]
   +- *Sort [partition_column#460 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(partition_column#460, 200)

Also i see the Stage as MapInternalPartition which i think is partitioned internally Now i don't know what is this. But because i think because of this even my 100 tasks took 30+ mins and in that 99 was completed within 1-2 mins and the last 1 task took remaining 30 mins leaving my cluster IDLE with no parallel processing which makes me think that is the data partitioned properly when Window function is used ???

I Tried to apply HashPartitioning by converting it to RDD... BECAUSE we cannot apply Custom / HashPartitioner on a Dataframe

So if i do this :

val myVal = DF.rdd.partitioner(new HashPartitioner(10000))

I am getting a return type of ANY with which i am not getting any Action list to perform.

I checked and saw that the column with which the Partitioning is happening in Window functions contains all NULL values

Unable to Convert to DataFrame from RDD after applying partitioning

Answers (1)

Related Questions