Reputation: 39
Edited with a new example for clarify
The following data
+------+---------+-----------+
| ID| location| loggedTime|
+------+---------+-----------+
| 67| 312| 12:09:00|
| 67| 375| 12:23:00|
| 67| 375| 12:25:00|
| 67| 650| 12:26:00|
| 75| 650| 12:27:00|
| 75| 650| 12:29:00|
| 75| 800| 12:30:00|
+------+---------+-----------+
should yield the below, where we compare each row to the previous column 'ID' and 'location'. I need to log each time ID was logged at a different location. They can visit the same location again later in the sequence, therefore dropDupicates on ID and location isn't possible
+------+---------+-----------+
| ID| location| loggedTime|
+------+---------+-----------+
| 67| 312| 12:09:00|
| 67| 375| 12:23:00|
| 67| 650| 12:26:00|
| 75| 650| 12:27:00|
| 75| 800| 12:30:00|
+------+---------+-----------+
Upvotes: 1
Views: 341
Reputation: 64
how about useing group by?
df = df.groupBy(col("id"), col("location")).agg(min(col("loggedTime")))
Upvotes: 0
Reputation: 14905
Using a Window ordered by loggedTime
can be used to get the location
from the previous row. Then the rows where the current and the previous location
are the same can be filtered out:
from pyspark.sql import functions as F
from pyspark.sql import Window
w=Window.partitionBy("ID").orderBy("loggedTime")
df.withColumn("prev_location", F.lag("location").over(w)) \
.filter("prev_location is null or location <> prev_location") \
.drop("prev_location") \
.show()
Output:
+---+--------+-------------------+
| ID|location| loggedTime|
+---+--------+-------------------+
| 67| 312|1970-01-01 00:09:00|
| 67| 375|1970-01-01 00:23:00|
| 67| 650|1970-01-01 00:26:00|
| 75| 650|1970-01-01 00:27:00|
| 75| 800|1970-01-01 00:30:00|
+---+--------+-------------------+
Upvotes: 1