user2316771
user2316771

Reputation: 149

How to select n rows from large data set using spark

I need to select n rows from very large data set which has millions of rows. Let's say 4 million rows out of 15 million. Currently, I'm adding row_number to records within each partition and selecting the required percentage of records from each partition. For instance, 4 million is 26.66 % of 15 million. But when I'm trying to choose 26 % from each partition, the total number is going down because of the missing 0.6 %. As shown below, rows are selected when the row_number is less than percentage. Is there a better way to do this ?

enter image description here

Upvotes: 1

Views: 792

Answers (1)

DataNoob
DataNoob

Reputation: 205

dataframe sample function can be used. Solution available in below link How to select an exact number of random rows from DataFrame

Upvotes: 2

Related Questions