spark group and reduce based on adjacent rows as well as single row

Question

I have a dataframe like below and want to reduce them by combining adjacent rowa i.e previous.close = current.open

val df = Seq(
  ("Ray","2018-09-01","2018-09-10"),
  ("Ray","2018-09-10","2018-09-15"),
  ("Ray","2018-09-16","2018-09-18"),
  ("Ray","2018-09-21","2018-09-27"),
  ("Ray","2018-09-27","2018-09-30"),
  ("Scott","2018-09-21","2018-09-23"),
  ("Scott","2018-09-24","2018-09-28"),
  ("Scott","2018-09-28","2018-09-30"),
  ("Scott","2018-10-05","2018-10-09"),
  ("Scott","2018-10-11","2018-10-15"),
  ("Scott","2018-10-15","2018-09-20")
)

The required output is below:

  (("Ray","2018-09-01","2018-09-15"),
  ("Ray","2018-09-16","2018-09-18"),
  ("Ray","2018-09-21","2018-09-30"),
  ("Scott","2018-09-21","2018-09-23"),
  ("Scott","2018-09-24","2018-09-30"),
  ("Scott","2018-10-05","2018-10-09"),
  ("Scott","2018-10-11","2018-10-20"))

So, far, I'm able to condense the adjacent rows by using the below DF() solution.

df.alias("t1").join(df.alias("t2"),$"t1.name" === $"t2.name" and $"t1.close"=== $"t2.open" )
  .select("t1.name","t1.open","t2.close")
  .distinct.show(false) 

|name |open      |close     |
+-----+----------+----------+
|Scott|2018-09-24|2018-09-30|
|Scott|2018-10-11|2018-09-20|
|Ray  |2018-09-01|2018-09-15|
|Ray  |2018-09-21|2018-09-30|
+-----+----------+----------+

I'm trying to use similar style to get single rows by giving $"t1.close"=!= $"t2.open" and then doing a union of both to get the final result. But I get unwanted rows, which I'm not able to filter correctly. How to achieve this?.

This post is not same as Spark SQL window function with complex condition where it calculates additional date column as a new column.

Leo C · Accepted Answer

Here's one approach:

Create new column temp1 with null value if current open equals previous close; otherwise value of current open
Create another column temp2 that backfills nulls in temp1 with the last non-null value
Group resulting dataset by (name, temp2) to generate the contiguous date ranges

I've revised your sample data to cover cases of contiguous date range over 2+ rows.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = Seq(
  ("Ray","2018-09-01","2018-09-10"),
  ("Ray","2018-09-10","2018-09-15"),
  ("Ray","2018-09-16","2018-09-18"),
  ("Ray","2018-09-21","2018-09-27"),
  ("Ray","2018-09-27","2018-09-30"),
  ("Scott","2018-09-21","2018-09-23"),
  ("Scott","2018-09-23","2018-09-28"),  // <-- Revised
  ("Scott","2018-09-28","2018-09-30"),
  ("Scott","2018-10-05","2018-10-09"),
  ("Scott","2018-10-11","2018-10-15"),
  ("Scott","2018-10-15","2018-10-20")
).toDF("name", "open", "close")

val win = Window.partitionBy($"name").orderBy("open", "close")

val df2 = df.
  withColumn("temp1", when(
    row_number.over(win) === 1 || lag($"close", 1).over(win) =!= $"open", $"open")
  ).
  withColumn("temp2", last($"temp1", ignoreNulls=true).over(
    win.rowsBetween(Window.unboundedPreceding, 0)
  ))

df2.show
// +-----+----------+----------+----------+----------+
// | name|      open|     close|     temp1|     temp2|
// +-----+----------+----------+----------+----------+
// |Scott|2018-09-21|2018-09-23|2018-09-21|2018-09-21|
// |Scott|2018-09-23|2018-09-28|      null|2018-09-21|
// |Scott|2018-09-28|2018-09-30|      null|2018-09-21|
// |Scott|2018-10-05|2018-10-09|2018-10-05|2018-10-05|
// |Scott|2018-10-11|2018-10-15|2018-10-11|2018-10-11|
// |Scott|2018-10-15|2018-10-20|      null|2018-10-11|
// |  Ray|2018-09-01|2018-09-10|2018-09-01|2018-09-01|
// |  Ray|2018-09-10|2018-09-15|      null|2018-09-01|
// |  Ray|2018-09-16|2018-09-18|2018-09-16|2018-09-16|
// |  Ray|2018-09-21|2018-09-27|2018-09-21|2018-09-21|
// |  Ray|2018-09-27|2018-09-30|      null|2018-09-21|
// +-----+----------+----------+----------+----------+

The above shows result of step 1 and 2 with temp2 holding the value of the earliest open of the corresponding contiguous date range. Step 3 uses max to get the latest close of the date range:

df2.
  groupBy($"name", $"temp2".as("open")).agg(max($"close").as("close")).
  show
// +-----+----------+----------+
// |name |open      |close     |
// +-----+----------+----------+
// |Scott|2018-09-21|2018-09-30|
// |Scott|2018-10-05|2018-10-09|
// |Scott|2018-10-11|2018-10-20|
// |Ray  |2018-09-01|2018-09-15|
// |Ray  |2018-09-16|2018-09-18|
// |Ray  |2018-09-21|2018-09-30|
// +-----+----------+----------+

spark group and reduce based on adjacent rows as well as single row

Answers (2)

Related Questions