puifais
puifais

Reputation: 870

How to sample() exact number of rows, not fraction, of PySpark DataFrame

I would like to select the exact number of rows randomly from my PySpark DataFrame. I know of the function sample(). But it won't let me input the exact number of rows I want. The problem is when I do sampled_df = df.sample(0.2), if my df has 1,000,000 rows, I don't necessarily get 200,000 rows in sampled_df

Upvotes: 1

Views: 5281

Answers (1)

Vaebhav
Vaebhav

Reputation: 5032

You can use a combination of rand and limit , specifying the required n number of rows

sparkDF.orderBy(F.rand()).limit(n)

Note it is a simple implementation, which provides you a rough number of rows, additionally you can filter the dataset to your required conditions first , as OrderBy is a costly operation

Upvotes: 7

Related Questions