Reputation: 11
Let's say I want to do a check on the distinct name count on the same group, and I only want to keep one name in this group.
df.groupBy('job','age','gender').agg(countDistinct('name')).filter('count(DISTINCT name)>1').show()
job | age | gender | count(DISTINCT name) |
---|---|---|---|
engineer | 22 | M | 3 |
Then I want to go to this group, to keep only one of the name. Let's say when we go to this group, we have something like this:
job | age | gender | name | score |
---|---|---|---|---|
engineer | 22 | M | John | 10 |
engineer | 22 | M | Leo | 15 |
engineer | 22 | M | Leo | 16 |
engineer | 22 | M | Mike | 17 |
engineer | 22 | M | Mike | 19 |
And then, I want to keep Mike only (drop John and Leo in this group) So I want the group to be like this
job | age | gender | name | score |
---|---|---|---|---|
engineer | 22 | M | Mike | 17 |
engineer | 22 | M | Mike | 19 |
How to use a function in pyspark to implement this? So that I can apply this function in different df? Thanks
Upvotes: 0
Views: 58
Reputation: 9308
You can use rank
window function.
w = Window.partitionBy('job', 'age', 'gender').orderBy(desc('name'))
df = (df.withColumn('rnk', rank().over(w))
.filter(col('rnk') == 1))
Upvotes: 0