Reputation: 110
suppose we have a DataFrame with column "event":
events
A
b
c
d
e
A
b
c
d
A
f
I want to get WindowGroups or just new column with row_number() over such split
events
A
b
c
d
e
f
g
----- split here ---
A
b
c
d
----- split here ---
A
f
So I want to put all rows between to "A" in "events" column to one group. How to do it? I feel like it can be done with Window functions.
Upvotes: 1
Views: 1050
Reputation: 110
Finally, I got a solution by myself. Here it is:
import org.apache.spark.sql.expressions.Window
val windowIndex = Window.partitionBy().orderBy("time")
val result = eventWithTime
.withColumn("groupId",
when($"events" === "A", row_number over windowIndex).otherwise(null))
.withColumn("groupId", last("groupId", ignoreNulls = true) over windowIndex)
.filter($"groupId".isNotNull)
(I use column "time" just to order events like in example)
The idea here is to find all "events" with "A" and mark them with unique id. I did it using row_number
and Window.partitionBy()
functions. (Maybe it is would be better to use monotonically_increasing_id
but I have a lot of data and there are some assumptions for correct work of monotonically_increasing_id
). After that, I used function last
with same window. Important thing here is to set ignoreNulls to "true". In that way all nulls will be filled with first non-null value before current row. Then I just drop first rows before first "A" because they are still nulls.
Ex.:
events
A
b
c
d
e
A
b
c
d
A
f
events | groupId
A | 1
b | null
c | null
d | null
e | null
A | 2
b | null
c | null
d | null
A | 3
f | null
events | groupId
A | 1
b | 1
c | 1
d | 1
e | 1
A | 2
b | 2
c | 2
d | 2
A | 3
f | 3
Now we can just groupBy
or partitionBy
by groupId and do what we want.
Upvotes: 2