Tony LaRussa
Tony LaRussa

Reputation: 39

pyspark dataframe retrieve the first value in each sequence within an ordered column

Edited with a new example for clarify

The following data

+------+---------+-----------+
|    ID| location| loggedTime|
+------+---------+-----------+
|    67|      312|   12:09:00|    
|    67|      375|   12:23:00|     
|    67|      375|   12:25:00|     
|    67|      650|   12:26:00|     
|    75|      650|   12:27:00|     
|    75|      650|   12:29:00|     
|    75|      800|   12:30:00|     
+------+---------+-----------+

should yield the below, where we compare each row to the previous column 'ID' and 'location'. I need to log each time ID was logged at a different location. They can visit the same location again later in the sequence, therefore dropDupicates on ID and location isn't possible


+------+---------+-----------+
|    ID| location| loggedTime|
+------+---------+-----------+
|    67|      312|   12:09:00|    
|    67|      375|   12:23:00|     
|    67|      650|   12:26:00|     
|    75|      650|   12:27:00|     
|    75|      800|   12:30:00|     
+------+---------+-----------+

Upvotes: 1

Views: 341

Answers (2)

Powerful Lee
Powerful Lee

Reputation: 64

how about useing group by?

df = df.groupBy(col("id"), col("location")).agg(min(col("loggedTime")))

Upvotes: 0

werner
werner

Reputation: 14905

Using a Window ordered by loggedTime can be used to get the location from the previous row. Then the rows where the current and the previous location are the same can be filtered out:

from pyspark.sql import functions as F
from pyspark.sql import Window

w=Window.partitionBy("ID").orderBy("loggedTime")
df.withColumn("prev_location", F.lag("location").over(w)) \
  .filter("prev_location is null or location <> prev_location") \
  .drop("prev_location") \
  .show()

Output:

+---+--------+-------------------+
| ID|location|         loggedTime|
+---+--------+-------------------+
| 67|     312|1970-01-01 00:09:00|
| 67|     375|1970-01-01 00:23:00|
| 67|     650|1970-01-01 00:26:00|
| 75|     650|1970-01-01 00:27:00|
| 75|     800|1970-01-01 00:30:00|
+---+--------+-------------------+

Upvotes: 1

Related Questions