gabbi
gabbi

Reputation: 110

ROW_NUMBER() over groups by condition

suppose we have a DataFrame with column "event":

events
A
b
c
d
e
A
b
c
d
A
f

I want to get WindowGroups or just new column with row_number() over such split

events
A
b
c
d
e
f
g
----- split here ---
A
b
c
d
----- split here ---
A
f

So I want to put all rows between to "A" in "events" column to one group. How to do it? I feel like it can be done with Window functions.

Upvotes: 1

Views: 1050

Answers (1)

gabbi
gabbi

Reputation: 110

Finally, I got a solution by myself. Here it is:

import org.apache.spark.sql.expressions.Window
 val windowIndex = Window.partitionBy().orderBy("time")

val result = eventWithTime
  .withColumn("groupId",
    when($"events" === "A", row_number over windowIndex).otherwise(null))
  .withColumn("groupId", last("groupId", ignoreNulls = true) over windowIndex)
  .filter($"groupId".isNotNull)

(I use column "time" just to order events like in example)

The idea here is to find all "events" with "A" and mark them with unique id. I did it using row_number and Window.partitionBy() functions. (Maybe it is would be better to use monotonically_increasing_id but I have a lot of data and there are some assumptions for correct work of monotonically_increasing_id). After that, I used function last with same window. Important thing here is to set ignoreNulls to "true". In that way all nulls will be filled with first non-null value before current row. Then I just drop first rows before first "A" because they are still nulls.

Ex.:

  1. Before any operations
events
A
b
c
d
e
A
b
c
d
A
f
  1. Assign unique id to all "A"s (otherwise null)
events | groupId
A      | 1
b      | null
c      | null
d      | null
e      | null
A      | 2
b      | null
c      | null
d      | null
A      | 3
f      | null
  1. Fill nulls using last non-null value
events | groupId
A      | 1
b      | 1
c      | 1
d      | 1
e      | 1
A      | 2
b      | 2
c      | 2
d      | 2
A      | 3
f      | 3

Now we can just groupBy or partitionBy by groupId and do what we want.

Upvotes: 2

Related Questions