Reputation: 39
I want to keep the last record not the first. However the keep="last"
option does not seem to work? For example on the following:
from pyspark.sql import Row
df = sc.parallelize([ \
Row(name='Alice', age=5, height=80), \
Row(name='Alice', age=5, height=80), \
Row(name='Alice', age=10, height=80)]).toDF()
df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 10| 80|Alice|
+---+------+-----+
And I run:
df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
+---+------+-----+
I would like the following:
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10| 80|Alice|
+---+------+-----+
The keep=last
does not appear to be an option in pyspark?
Upvotes: 1
Views: 4394
Reputation: 264
GroupBy is the more idiomatic approach to this problem
from pyspark.sql import functions as F
df.groupBy(['name', 'height'])\
.agg(F.max('age').alias('age'))
Upvotes: 0
Reputation: 42392
The common way to do this sort of tasks is to calculate a rank with a suitable partitioning and ordering, and get the rows with rank = 1:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rank',
F.rank().over(Window.partitionBy('name', 'height').orderBy(F.desc('age')))
).filter('rank = 1').drop('rank')
df2.show()
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 10| 80|
+-----+---+------+
Or another way is to use last
, but it gives indeterministic results:
import pyspark.sql.functions as F
df2 = df.groupBy('name', 'height').agg(
*[F.last(c).alias(c) for c in df.columns if c not in ['name', 'height']]
)
df2.show()
+-----+------+---+
| name|height|age|
+-----+------+---+
|Alice| 80| 10|
+-----+------+---+
Upvotes: 2