PySpark drop Duplicates and Keep Rows with highest value in a column

Question

I have the following Spark dataset:

id    col1    col2    col3    col4
1      1        5       2      3
1      1        0       2      3
2      3        1       7      7
3      6        1       3      3
3      6        5       3      3

I would like to drop the duplicates in the columns subset ['id,'col1','col3','col4'] and keep the duplicate rows with the highest value in col2. This is what the result should look like:

id    col1    col2    col3    col4
1      1        5       2      3
2      3        1       7      7
3      6        5       3      3

How can I do that in PySpark?

wwnde · Accepted Answer

Another way, compute the max, filter where max=col2. This allows you to keep multiple instances where the condition is true

df.withColumn('max',max('col2').over(Window.partitionBy('id'))).where(col('col2')==col('max')).show()

PySpark drop Duplicates and Keep Rows with highest value in a column

Answers (2)

Related Questions