how calculate the sum of two adjacent numbers in a RDD with spark/scala?

Question

I want to calculates the sum of two adjacent numbers in a RDD,My quick and dirty approach is to first create an array,then get sum of two adjacent numbers,but that is so ugly and Inefficient，

   val rdd = sc.parallelize(1 to 9)
    val sumNum:RDD[Int] =rdd.mapPartitions((parIter=>{
      var sum=new ArrayBuffer[Int]()
      var result=new ArrayBuffer[Int]()
      while (parIter.hasNext) {
        result.append(parIter.next())
      }
      for(i<-0 until result.length-1){
        sum.append(result(i)+result(i+1))
      }
      sum.toIterator
    }))
    sumNum.collect().foreach(println)

Anyways, is there a better solution? Thanks!

mtoto · Accepted Answer

For convenience's sake, you should probably resort to Window functions present in the DataFrame api. Here's a reproducible example:

import org.apache.spark.sql.functions.{col,sum}
import org.apache.spark.sql.expressions.Window

// Define window: current and next row
val w = Window.partitionBy().orderBy("value").rowsBetween(0,1)

// Calculate sum over the defined window
rdd.toDF()
  .withColumn("cumSum", sum(col("value"))
  .over(w)).show()
+-----+------+
|value|cumSum|
+-----+------+
|    1|     3|
|    2|     5|
|    3|     7|
|    4|     9|
|    5|    11|
|    6|    13|
|    7|    15|
|    8|    17|
|    9|     9|
+-----+------+

how calculate the sum of two adjacent numbers in a RDD with spark/scala?

Answers (1)

Related Questions