Reputation: 123
I have a Spark DataFrame with data like below:
ID | UseCase
-----------------
0 | Unidentified
1 | Unidentified
2 | Unidentified
3 | Unidentified
4 | UseCase1
5 | UseCase1
6 | Unidentified
7 | Unidentified
8 | UseCase2
9 | UseCase2
10 | UseCase2
11 | Unidentified
12 | Unidentified
I have to extract the top 4 rows which have value Unidentified
in column UseCase
and do further processing with them. I don't want to get the middle and last two rows with Unidentified
value at this point.
I want to avoid using the ID
column as they are not fixed. The above data is just a sample.
When I use map function (after converting this to RDD) or UDFs, I end up with 8 rows in my output DataFrame (which is expected of these functions).
How can this be achieved? I am working in PySpark. I don't want to use collect on the DataFrame and get it as a list to iterate over. This would defeat the purpose of Spark. The DataFrame size can go up to 4-5 GB.
Could you please suggest how this can be done? Thanks in advance!
Upvotes: 1
Views: 5949
Reputation: 758
Just do a filter and a limit. The following code is Scala, but you'll understand the point.
Assume your dataframe is called df, then:
df.filter($"UseCase"==="Unidentified").limit(4).collect()
Upvotes: 2