Reputation: 549
I am fairly new to Pyspark and I am running into some hiccups. Let's say I have two fields:
| firstName | lastName | |-----------|----------| | Bill | Apple | | Mike | Apple | | Jeff | Apple | | Paul | Apple | | George | Bowers | | Kevin | Bowers | | Leon | Bowers | | Fred | Bowers |
My question is around how to randomly select 2 random rows for each distinct value of the last name? Like this:
| firstName | lastName | |-----------|----------| | Jeff | Apple | | Bill | Apple | | Fred | Bowers | | Kevin | Bowers |
What I was thinking was to generate a list of the distinct last names and run through a for loop, but that obviously won't be recommended using Pyspark's framework. I would think that the use of parallel computing would be the recommended approach in this case?
Upvotes: 0
Views: 196
Reputation: 15258
You can do this with some analytic functions' magic ✨
from pyspark.sql import functions as F, Window
df.withColumn(
"r", F.row_number().over(Window.partitionBy("lastName").orderBy(F.rand()))
).where(F.col("r") <= 2).drop("r").show()
+---------+--------+
|firstName|lastName|
+---------+--------+
| Paul | Apple |
| Bill | Apple |
| Kevin | Bowers|
| Leon | Bowers|
+---------+--------+
If I re-run it for example :
+---------+--------+
|firstName|lastName|
+---------+--------+
| Paul | Apple |
| Mike | Apple |
| Fred | Bowers|
| Kevin | Bowers|
+---------+--------+
Upvotes: 1