Ram Kumar V
Ram Kumar V

Reputation: 123

Extract first "set of rows" matching a particular condition in Spark Dataframe (Pyspark)

I have a Spark DataFrame with data like below:

ID | UseCase
-----------------
0  | Unidentified
1  | Unidentified
2  | Unidentified
3  | Unidentified
4  | UseCase1
5  | UseCase1
6  | Unidentified
7  | Unidentified
8  | UseCase2
9  | UseCase2
10 | UseCase2
11 | Unidentified
12 | Unidentified

I have to extract the top 4 rows which have value Unidentified in column UseCase and do further processing with them. I don't want to get the middle and last two rows with Unidentified value at this point.

I want to avoid using the ID column as they are not fixed. The above data is just a sample. When I use map function (after converting this to RDD) or UDFs, I end up with 8 rows in my output DataFrame (which is expected of these functions).

How can this be achieved? I am working in PySpark. I don't want to use collect on the DataFrame and get it as a list to iterate over. This would defeat the purpose of Spark. The DataFrame size can go up to 4-5 GB.

Could you please suggest how this can be done? Thanks in advance!

Upvotes: 1

Views: 5949

Answers (1)

Chondrops
Chondrops

Reputation: 758

Just do a filter and a limit. The following code is Scala, but you'll understand the point.

Assume your dataframe is called df, then:

df.filter($"UseCase"==="Unidentified").limit(4).collect()

Upvotes: 2

Related Questions