How to extract the rows that changed id-value in a dataframe?

Question

I'm trying to extract from a dataframe only the rows that the id is changing: Let's say we have the following dataframe:

# id   Date  Value
# 152  12/4  True
# 152  12/4  True
# 152  12/4  True
# 158  12/4  True
# 158  13/4  False
# 158  13/4  False

I want to create a new Dataframe only with the Values when the id is changing and the preview row:

# id   Date   Value
# 152  12/4   True
# 158  12/4   True

I try with lag and window function but I didnt have a good result. Thanks in advance.

Ranga Vure · Accepted Answer

using lag and lead, here is a solution. As per your requirement, when id is changing, this selects current row and also previous row. I modified the test data to cover other scenarios

from pyspark.sql.window import Window
import pyspark.sql.functions as F
df = spark.createDataFrame([[151, '12/4', True],
                            [152, '12/4', True],
                            [152, '12/4', True],
                            [158, '12/4', True],
                            [158, '12/4', True],
                            [158, '12/4', True]
                            ], schema=['id', 'Date', 'Value'])
window = Window.orderBy("id")
df = df.withColumn("prev_id", F.lag(F.col("id")).over(window))
df = df.withColumn("next_id", F.lead(F.col("id")).over(window))

df.filter(
    'id != next_id or id != prev_id'
).drop(
    'prev_id','next_id'
).show()

which results

+---+----+-----+
| id|Date|Value|
+---+----+-----+
|151|12/4| true|
|152|12/4| true| (Id changed, so select and previous row)
|152|12/4| true|
|158|12/4| true| (Id changed, so select and previous row)
+---+----+-----+

How to extract the rows that changed id-value in a dataframe?

Answers (1)

Related Questions