Reputation: 473
I want to calculates the sum of two adjacent numbers in a RDD,My quick and dirty approach is to first create an array,then get sum of two adjacent numbers,but that is so ugly and Inefficient,
val rdd = sc.parallelize(1 to 9)
val sumNum:RDD[Int] =rdd.mapPartitions((parIter=>{
var sum=new ArrayBuffer[Int]()
var result=new ArrayBuffer[Int]()
while (parIter.hasNext) {
result.append(parIter.next())
}
for(i<-0 until result.length-1){
sum.append(result(i)+result(i+1))
}
sum.toIterator
}))
sumNum.collect().foreach(println)
Anyways, is there a better solution? Thanks!
Upvotes: 0
Views: 617
Reputation: 24188
For convenience's sake, you should probably resort to Window
functions present in the DataFrame
api. Here's a reproducible example:
import org.apache.spark.sql.functions.{col,sum}
import org.apache.spark.sql.expressions.Window
// Define window: current and next row
val w = Window.partitionBy().orderBy("value").rowsBetween(0,1)
// Calculate sum over the defined window
rdd.toDF()
.withColumn("cumSum", sum(col("value"))
.over(w)).show()
+-----+------+
|value|cumSum|
+-----+------+
| 1| 3|
| 2| 5|
| 3| 7|
| 4| 9|
| 5| 11|
| 6| 13|
| 7| 15|
| 8| 17|
| 9| 9|
+-----+------+
Upvotes: 1