Karry
Karry

Reputation: 11

pyspark keep only one type of group

Let's say I want to do a check on the distinct name count on the same group, and I only want to keep one name in this group.

df.groupBy('job','age','gender').agg(countDistinct('name')).filter('count(DISTINCT name)>1').show()
job age gender count(DISTINCT name)
engineer 22 M 3

Then I want to go to this group, to keep only one of the name. Let's say when we go to this group, we have something like this:

job age gender name score
engineer 22 M John 10
engineer 22 M Leo 15
engineer 22 M Leo 16
engineer 22 M Mike 17
engineer 22 M Mike 19

And then, I want to keep Mike only (drop John and Leo in this group) So I want the group to be like this

job age gender name score
engineer 22 M Mike 17
engineer 22 M Mike 19

How to use a function in pyspark to implement this? So that I can apply this function in different df? Thanks

Upvotes: 0

Views: 58

Answers (1)

Emma
Emma

Reputation: 9308

You can use rank window function.

w = Window.partitionBy('job', 'age', 'gender').orderBy(desc('name'))
df = (df.withColumn('rnk', rank().over(w))
      .filter(col('rnk') == 1))

Upvotes: 0

Related Questions