How to random shuffle all rows of a large table in bigquery dataset?

Question

Memory needed for a query like "select id from table order by rand()" will exceed allotted memory, and thus results in failure of the query. How can I get a random permutation of all rows from a pretty large table? The size of table is over 1 billion rows.

Mikhail Berlyant · Accepted Answer

Business of our team requires to sample from the whole dataset repeatedly. The sample size varies each time from 80 million to 100k ...

Below is for BigQuery Standard SQL and uses table with 12+ billion rows, so should work for you too :o)

The pattern for query is:

#standardSQL
SELECT start_position 
FROM `bigquery-public-data.genomics_rice.Rice3K_DeepVariant_Os_Nipponbare_Reference_IRGSP_1_0`
WHERE MOD(CAST( * RAND() AS INT64), ) = 1

Formally = CAST( / AS INT64)
And in place of 1 in = 1 - can be any integer between 0 and - 1

So for sample of 100K - you can use

#standardSQL
SELECT start_position 
FROM `bigquery-public-data.genomics_rice.Rice3K_DeepVariant_Os_Nipponbare_Reference_IRGSP_1_0`
WHERE MOD(CAST(12186710727 * RAND() AS INT64), 121867) = 1

Above returned 99,770 sampled rows in 8 seconds

And for sample of 80M - you can use

#standardSQL
SELECT start_position 
FROM `bigquery-public-data.genomics_rice.Rice3K_DeepVariant_Os_Nipponbare_Reference_IRGSP_1_0`
WHERE MOD(CAST(12186710727 * RAND() AS INT64), 152) = 1

returned 80,175,660 sampled rows in 13 seconds

How to random shuffle all rows of a large table in bigquery dataset?

Answers (1)

Related Questions