How to calculate the current row with the next one?

Question

In Spark-Sql version 1.6, using DataFrames, is there a way to calculate, for a specific column, the sum of the current row and the next one, for every row?

For example, if I have a table with one column, like so

I'd like the following output

Sum
35
54
98

The last row is dropped because it has no "next row" to be added to.

Right now I am doing it by ranking the table and joining it with itself, where the rank is equals to rank+1.

Is there a better way to do this? Can this be done with a Window function?

Ramesh Maharjan · Accepted Answer

Yes definitely you can do with Window function by using rowsBetween function. I have used person column for grouping purpose in my following example.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val dataframe = Seq(
  ("A",12),
  ("A",23),
  ("A",31),
  ("A",67)
).toDF("person", "Age")

val windowSpec = Window.partitionBy("person").orderBy("Age").rowsBetween(0, 1)
val newDF = dataframe.withColumn("sum", sum(dataframe("Age")) over(windowSpec))
  newDF.filter(!(newDF("Age") === newDF("sum"))).show

How to calculate the current row with the next one?

Answers (1)

Related Questions