Reputation: 149
I need to select n rows from very large data set which has millions of rows. Let's say 4 million rows out of 15 million. Currently, I'm adding row_number to records within each partition and selecting the required percentage of records from each partition. For instance, 4 million is 26.66 % of 15 million. But when I'm trying to choose 26 % from each partition, the total number is going down because of the missing 0.6 %. As shown below, rows are selected when the row_number is less than percentage. Is there a better way to do this ?
Upvotes: 1
Views: 792
Reputation: 205
dataframe sample function can be used. Solution available in below link How to select an exact number of random rows from DataFrame
Upvotes: 2