Reputation: 163
I have data like you see below in files that get delivered to me on a regular basis
2018-11-06 00 46 3574
2018-11-06 01 21 1667
2018-11-06 02 22 1650
I very often get duplicates in subsequent files and I handle those with something like:
df2=df.dropDuplicates()
My problem is that sometimes I'll get a value that is a duplicate of the first 2 columns but the 3rd and 4th columns are different. So in two different files I might get this:
FILE 1
2018-11-06 01 21 1667
FILE 2
2018-11-06 01 23 1682
and then my dataframe looks like this:
2018-11-06 00 46 3574
2018-11-06 01 21 1667
2018-11-06 01 23 1682
2018-11-06 02 22 1650
When this happens I want unique values for the first 2 columns, and I want to do that by keeping the row that has the highest value in the 4th column. So I want my data to look like this when I'm done:
2018-11-06 00 46 3574
2018-11-06 01 23 1682
2018-11-06 02 22 1650
And I can't figure out how to do it. Feels like I need to leverage a window but I can't figure out the way to do.
Upvotes: 1
Views: 872
Reputation: 1953
The most straightforward way to achieve this is to use rank()
. I haven't tested it, but below is some mostly functional code which should get you 95% of the way.
from pyspark.sql import Window
from pyspark.sql.functions import desc, rank
# The ranking ordering is by col4 descending, then col3 descending as a tiebreaker.
rank_window = Window.partitionBy("col1", "col2").orderBy(desc("col4"), desc("col3"))
df.withColumn("rank", rank().over(rank_window)).filter("rank=1")
Upvotes: 2