How to split data into groups in pyspark

Question

I need to find groups in time series data.

Data sample

I need to output column group based on value and day.

I've tried using lag, lead and row_number but it ended up to nothing.

murtihash · Accepted Answer

PySpark way to do this. Find endpoints of groups using lag, do an incremental sum on this lag to get groups, add 1 to groups to get your desired groups.

from pypsark.sql.window import Window
from pyspark.sql import functions as F

w1=Window().orderBy("day")
df.withColumn("lag", F.when(F.lag("value").over(w1)!=F.col("value"), F.lit(1)).otherwise(F.lit(0)))\
  .withColumn("group", F.sum("lag").over(w1) + 1).drop("lag").show()

#+-----+---+-----+
#|value|day|group|
#+-----+---+-----+
#|    1|  1|    1|
#|    1|  2|    1|
#|    1|  3|    1|
#|    1|  4|    1|
#|    1|  5|    1|
#|    2|  6|    2|
#|    2|  7|    2|
#|    1|  8|    3|
#|    1|  9|    3|
#|    1| 10|    3|
#|    1| 11|    3|
#|    1| 12|    3|
#|    1| 13|    3|
#+-----+---+-----+

How to split data into groups in pyspark

Answers (2)

Related Questions