Reputation: 47
I'm trying to extract from a dataframe only the rows that the id is changing: Let's say we have the following dataframe:
# id Date Value
# 152 12/4 True
# 152 12/4 True
# 152 12/4 True
# 158 12/4 True
# 158 13/4 False
# 158 13/4 False
I want to create a new Dataframe only with the Values when the id is changing and the preview row:
# id Date Value
# 152 12/4 True
# 158 12/4 True
I try with lag and window function but I didnt have a good result. Thanks in advance.
Upvotes: 0
Views: 52
Reputation: 1932
using lag
and lead
, here is a solution. As per your requirement, when id is changing, this selects current row and also previous row. I modified the test data to cover other scenarios
from pyspark.sql.window import Window
import pyspark.sql.functions as F
df = spark.createDataFrame([[151, '12/4', True],
[152, '12/4', True],
[152, '12/4', True],
[158, '12/4', True],
[158, '12/4', True],
[158, '12/4', True]
], schema=['id', 'Date', 'Value'])
window = Window.orderBy("id")
df = df.withColumn("prev_id", F.lag(F.col("id")).over(window))
df = df.withColumn("next_id", F.lead(F.col("id")).over(window))
df.filter(
'id != next_id or id != prev_id'
).drop(
'prev_id','next_id'
).show()
which results
+---+----+-----+
| id|Date|Value|
+---+----+-----+
|151|12/4| true|
|152|12/4| true| (Id changed, so select and previous row)
|152|12/4| true|
|158|12/4| true| (Id changed, so select and previous row)
+---+----+-----+
Upvotes: 2