Reputation: 1022
I have a dataframe df with columns
date: timestamp
status : String
name : String
I'm trying to find last status of the all the names
val users = df.select("name").distinct
val final_status = users.map( t =>
{
val _name = t.getString(0)
val record = df.where(col("name") === _name)
val lastRecord = userRecord.sort(desc("date")).first
lastRecord
})
This works with an array, but with spark dataframe it is throwing java.lang.NullPointerException
Update1 : Using removeDuplicates
df.sort(desc("date")).removeDuplicates("name")
Is this a good solution?
Upvotes: 1
Views: 849
Reputation: 11573
This
df.sort(desc("date")).removeDuplicates("name")
is not guaranteed to work. The solutions in response to this question should work for you
spark: How to do a dropDuplicates on a dataframe while keeping the highest timestamped row
Upvotes: 1