Distinct values from a column to use to parallelize a Pyspark dataframe to randomly select values

Question

I am fairly new to Pyspark and I am running into some hiccups. Let's say I have two fields:

| firstName | lastName |
|-----------|----------|
| Bill      | Apple    |
| Mike      | Apple    |
| Jeff      | Apple    |
| Paul      | Apple    |
| George    | Bowers   |
| Kevin     | Bowers   |
| Leon      | Bowers   |
| Fred      | Bowers   |

My question is around how to randomly select 2 random rows for each distinct value of the last name? Like this:

| firstName | lastName |
|-----------|----------|
| Jeff      | Apple    |
| Bill      | Apple    |
| Fred      | Bowers   |
| Kevin     | Bowers   |

What I was thinking was to generate a list of the distinct last names and run through a for loop, but that obviously won't be recommended using Pyspark's framework. I would think that the use of parallel computing would be the recommended approach in this case?

Steven · Accepted Answer

You can do this with some analytic functions' magic ✨

from pyspark.sql import functions as F, Window

df.withColumn(
    "r", F.row_number().over(Window.partitionBy("lastName").orderBy(F.rand()))
).where(F.col("r") <= 2).drop("r").show()
+---------+--------+
|firstName|lastName|
+---------+--------+
|   Paul  |  Apple |
|   Bill  |  Apple |
|   Kevin |  Bowers|
|   Leon  |  Bowers|
+---------+--------+

If I re-run it for example :

+---------+--------+
|firstName|lastName|
+---------+--------+
|   Paul  |  Apple |
|   Mike  |  Apple |
|   Fred  |  Bowers|
|   Kevin |  Bowers|
+---------+--------+

Distinct values from a column to use to parallelize a Pyspark dataframe to randomly select values

Answers (1)

Related Questions