OSUKevin
OSUKevin

Reputation: 39

Keep last when using dropduplicates?

I want to keep the last record not the first. However the keep="last" option does not seem to work? For example on the following:

from pyspark.sql import Row
df = sc.parallelize([ \
    Row(name='Alice', age=5, height=80), \
    Row(name='Alice', age=5, height=80), \
    Row(name='Alice', age=10, height=80)]).toDF()
df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|    80|Alice|
+---+------+-----+

And I run:

df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
+---+------+-----+

I would like the following:

+---+------+-----+
|age|height| name|
+---+------+-----+
| 10|    80|Alice|
+---+------+-----+

The keep=last does not appear to be an option in pyspark?

Upvotes: 1

Views: 4394

Answers (2)

Alex Grounds
Alex Grounds

Reputation: 264

GroupBy is the more idiomatic approach to this problem

from pyspark.sql import functions as F

df.groupBy(['name', 'height'])\
    .agg(F.max('age').alias('age'))

Upvotes: 0

mck
mck

Reputation: 42392

The common way to do this sort of tasks is to calculate a rank with a suitable partitioning and ordering, and get the rows with rank = 1:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'rank',
    F.rank().over(Window.partitionBy('name', 'height').orderBy(F.desc('age')))
).filter('rank = 1').drop('rank')

df2.show()
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 10|    80|
+-----+---+------+

Or another way is to use last, but it gives indeterministic results:

import pyspark.sql.functions as F

df2 = df.groupBy('name', 'height').agg(
    *[F.last(c).alias(c) for c in df.columns if c not in ['name', 'height']]
)

df2.show()
+-----+------+---+
| name|height|age|
+-----+------+---+
|Alice|    80| 10|
+-----+------+---+

Upvotes: 2

Related Questions