Pyspark - removing rows with matching values based on a comparison of other values

Question

I have data like you see below in files that get delivered to me on a regular basis

2018-11-06 00        46      3574                                             
2018-11-06 01        21      1667                                             
2018-11-06 02        22      1650

I very often get duplicates in subsequent files and I handle those with something like:

df2=df.dropDuplicates()

My problem is that sometimes I'll get a value that is a duplicate of the first 2 columns but the 3rd and 4th columns are different. So in two different files I might get this:

FILE 1
2018-11-06 01        21      1667 

FILE 2
2018-11-06 01        23      1682

and then my dataframe looks like this:

2018-11-06 00        46      3574
2018-11-06 01        21      1667
2018-11-06 01        23      1682
2018-11-06 02        22      1650

When this happens I want unique values for the first 2 columns, and I want to do that by keeping the row that has the highest value in the 4th column. So I want my data to look like this when I'm done:

2018-11-06 00        46      3574
2018-11-06 01        23      1682
2018-11-06 02        22      1650

And I can't figure out how to do it. Feels like I need to leverage a window but I can't figure out the way to do.

ayplam · Accepted Answer

The most straightforward way to achieve this is to use rank(). I haven't tested it, but below is some mostly functional code which should get you 95% of the way.

from pyspark.sql import Window
from pyspark.sql.functions import desc, rank

# The ranking ordering is by col4 descending, then col3 descending as a tiebreaker.
rank_window = Window.partitionBy("col1", "col2").orderBy(desc("col4"), desc("col3"))
df.withColumn("rank", rank().over(rank_window)).filter("rank=1")

Pyspark - removing rows with matching values based on a comparison of other values

Answers (1)

Related Questions