Apache Spark , detect key change on a row and group rows

Question

I have the below dataframe in spark where I need to detect the key change ( on column rec) and create a new column called groupId. For example the first row and second row belong to one group until again the same set of record (D) is encountered and 1st row and 2nd row belong to the same groupId.

rec    amount  date           
D        250     20220522                  
C        110     20220522                  
D        120    20220522                   
C        100    20220522                   
C        50     20220522                   
D        50     20220522                   
D        50     20220522                   
D        50     20220522

EXPECTED OUTPUT

rec    amount  date            groupId   
D        250     20220522       1           
C        110     20220522       1           
D        120    20220522        2           
C        100    20220522        2          
C        50     20220522        2           
D        50     20220522        3           
D        50     20220522        4           
D        50     20220522        5

I tried many ways but couldn't achieve the desired output , I am not sure what I am doing incorrectly here , below is what I have tried

WindowSpec window = Window.orderBy("date");
 Dataset dataset4 = data

            .withColumn("nextRow", functions.lead("rec", 1).over(window))
            .withColumn("prevRow", functions.lag("rec", 1).over(window))
            .withColumn("groupId",
                functions.when(functions.col("nextRow")
                        .equalTo(functions.col("prevRow")),
                        functions.dense_rank().over(window)
                    ));

Can someone please help me what I am doing incorrectly here ?

Apache Spark , detect key change on a row and group rows

Answers (1)

Related Questions