Extract first "set of rows" matching a particular condition in Spark Dataframe (Pyspark)

Question

I have a Spark DataFrame with data like below:

ID | UseCase
-----------------
0  | Unidentified
1  | Unidentified
2  | Unidentified
3  | Unidentified
4  | UseCase1
5  | UseCase1
6  | Unidentified
7  | Unidentified
8  | UseCase2
9  | UseCase2
10 | UseCase2
11 | Unidentified
12 | Unidentified

I have to extract the top 4 rows which have value Unidentified in column UseCase and do further processing with them. I don't want to get the middle and last two rows with Unidentified value at this point.

I want to avoid using the ID column as they are not fixed. The above data is just a sample. When I use map function (after converting this to RDD) or UDFs, I end up with 8 rows in my output DataFrame (which is expected of these functions).

How can this be achieved? I am working in PySpark. I don't want to use collect on the DataFrame and get it as a list to iterate over. This would defeat the purpose of Spark. The DataFrame size can go up to 4-5 GB.

Could you please suggest how this can be done? Thanks in advance!

Extract first "set of rows" matching a particular condition in Spark Dataframe (Pyspark)

Answers (1)

Related Questions

Extract first &quot;set of rows&quot; matching a particular condition in Spark Dataframe (Pyspark)

Answers (1)

Related Questions

Extract first "set of rows" matching a particular condition in Spark Dataframe (Pyspark)