pyspark keep only one type of group

Question

Let's say I want to do a check on the distinct name count on the same group, and I only want to keep one name in this group.

df.groupBy('job','age','gender').agg(countDistinct('name')).filter('count(DISTINCT name)>1').show()

job	age	gender	count(DISTINCT name)
engineer	22	M	3

Then I want to go to this group, to keep only one of the name. Let's say when we go to this group, we have something like this:

job	age	gender	name	score
engineer	22	M	John	10
engineer	22	M	Leo	15
engineer	22	M	Leo	16
engineer	22	M	Mike	17
engineer	22	M	Mike	19

And then, I want to keep Mike only (drop John and Leo in this group) So I want the group to be like this

job	age	gender	name	score
engineer	22	M	Mike	17
engineer	22	M	Mike	19

How to use a function in pyspark to implement this? So that I can apply this function in different df? Thanks

Answers (1)