Reputation: 141
Memory needed for a query like "select id from table order by rand()"
will exceed allotted memory, and thus results in failure of the query. How can I get a random permutation of all rows from a pretty large table? The size of table is over 1 billion rows.
Upvotes: 3
Views: 3724
Reputation: 172993
Business of our team requires to sample from the whole dataset repeatedly. The sample size varies each time from 80 million to 100k ...
Below is for BigQuery Standard SQL and uses table with 12+ billion rows, so should work for you too :o)
The pattern for query is:
#standardSQL
SELECT start_position
FROM `bigquery-public-data.genomics_rice.Rice3K_DeepVariant_Os_Nipponbare_Reference_IRGSP_1_0`
WHERE MOD(CAST(<total number of rows in table> * RAND() AS INT64), <ratio of sample>) = 1
Formally = CAST( / AS INT64)
And in place of 1
in = 1
- can be any integer between 0 and <ratio of sample> - 1
So for sample of 100K - you can use
#standardSQL
SELECT start_position
FROM `bigquery-public-data.genomics_rice.Rice3K_DeepVariant_Os_Nipponbare_Reference_IRGSP_1_0`
WHERE MOD(CAST(12186710727 * RAND() AS INT64), 121867) = 1
Above returned 99,770 sampled rows in 8 seconds
And for sample of 80M - you can use
#standardSQL
SELECT start_position
FROM `bigquery-public-data.genomics_rice.Rice3K_DeepVariant_Os_Nipponbare_Reference_IRGSP_1_0`
WHERE MOD(CAST(12186710727 * RAND() AS INT64), 152) = 1
returned 80,175,660 sampled rows in 13 seconds
Upvotes: 5