Reputation: 870
I would like to select the exact number of rows randomly from my PySpark DataFrame. I know of the function sample()
. But it won't let me input the exact number of rows I want. The problem is when I do sampled_df = df.sample(0.2)
, if my df
has 1,000,000 rows, I don't necessarily get 200,000 rows in sampled_df
Upvotes: 1
Views: 5281
Reputation: 5032
You can use a combination of rand and limit , specifying the required n
number of rows
sparkDF.orderBy(F.rand()).limit(n)
Note it is a simple implementation, which provides you a rough number of rows, additionally you can filter the dataset to your required conditions first , as OrderBy is a costly operation
Upvotes: 7