pyspark dataframe retrieve the first value in each sequence within an ordered column

Question

Edited with a new example for clarify

The following data

+------+---------+-----------+
|    ID| location| loggedTime|
+------+---------+-----------+
|    67|      312|   12:09:00|    
|    67|      375|   12:23:00|     
|    67|      375|   12:25:00|     
|    67|      650|   12:26:00|     
|    75|      650|   12:27:00|     
|    75|      650|   12:29:00|     
|    75|      800|   12:30:00|     
+------+---------+-----------+

should yield the below, where we compare each row to the previous column 'ID' and 'location'. I need to log each time ID was logged at a different location. They can visit the same location again later in the sequence, therefore dropDupicates on ID and location isn't possible


+------+---------+-----------+
|    ID| location| loggedTime|
+------+---------+-----------+
|    67|      312|   12:09:00|    
|    67|      375|   12:23:00|     
|    67|      650|   12:26:00|     
|    75|      650|   12:27:00|     
|    75|      800|   12:30:00|     
+------+---------+-----------+

werner · Accepted Answer

Using a Window ordered by loggedTime can be used to get the location from the previous row. Then the rows where the current and the previous location are the same can be filtered out:

from pyspark.sql import functions as F
from pyspark.sql import Window

w=Window.partitionBy("ID").orderBy("loggedTime")
df.withColumn("prev_location", F.lag("location").over(w)) \
  .filter("prev_location is null or location <> prev_location") \
  .drop("prev_location") \
  .show()

Output:

+---+--------+-------------------+
| ID|location|         loggedTime|
+---+--------+-------------------+
| 67|     312|1970-01-01 00:09:00|
| 67|     375|1970-01-01 00:23:00|
| 67|     650|1970-01-01 00:26:00|
| 75|     650|1970-01-01 00:27:00|
| 75|     800|1970-01-01 00:30:00|
+---+--------+-------------------+

pyspark dataframe retrieve the first value in each sequence within an ordered column

Answers (2)

Related Questions