Ted
Ted

Reputation: 1062

Efficient sampling of a fixed number of rows in BigQuery

I have a large dataset of size N, and want to get a (uniformly) random sample of size n. This question offers two possible solutions:

SELECT foo FROM mytable WHERE RAND() < n/N

→ This is fast, but doesn't give me exactly n rows (only approximately).

SELECT foo, RAND() as r FROM mytable ORDER BY r LIMIT n

→ This requires to sort N rows, which seems unnecessary and wasteful (especially if n << N).

Is there a solution that combines the advantages of both? I imagine I could use the first solution to select 2n rows, then sort this smaller dataset, but it's sort of ugly and not guaranteed to work, so I'm wondering whether there's a better option.

Upvotes: 2

Views: 7908

Answers (1)

enle lin
enle lin

Reputation: 1714

I compared the two queries execution times using BigQuery standard SQL with the natality sample dataset (137,826,763 rows) and getting a sample for source_year column of size n. The queries are executed without using cached results.

Query1:

SELECT source_year
FROM `bigquery-public-data.samples.natality`
WHERE RAND() < n/137826763

Query2:

SELECT source_year, rand() AS r
FROM `bigquery-public-data.samples.natality`
ORDER BY r
LIMIT n

Result:

n        Query1   Query2
1000     ~2.5s    ~2.5s
10000    ~3s      ~3s
100000   ~3s      ~4s
1000000  ~4.5s    ~15s

For n <= 105 the difference is ~ 1s and for n >= 106 the execution time differ significantly. The cause seems to be that when LIMIT is added to the query, then the ORDER BY runs on multiple workers. See the original answer provided by Mikhail Berlyant.

I thought your propose to combine both queries could be a possible solution. Therefore I compared the execution time for the combined query:

New Query:

SELECT source_year,rand() AS r
FROM (
  SELECT source_year
  FROM `bigquery-public-data.samples.natality`
  WHERE RAND() < 2*n/137826763)
ORDER BY r
LIMIT n

Result:

n       Query1    New Query
1000    ~2.5s     ~3s
10000   ~3s       ~3s
100000  ~3s       ~3s
1000000 ~4.5s     ~6s

The execution time in this case vary in <=1.5s for n <= 106. It is a good idea to select n+some_rows rows in the subquery instead of 2n rows, where some_rows is a constant number large enough to get more than n rows.

Regarding what you said about “not guaranteed to work”, I understand that you are worried that the new query doesn’t retrieve exactly n rows. In this case, if some_rows is large enough, it will always get more than n rows in the subquery. Therefore, the query will return exactly n rows.

To summarize, the combined query is not so fast as Query1 but it get exactly n rows and it is faster than the Query2. So, it could be a solution for uniformly random samples. I want to point out that if ORDER BY is not specified, the BigQuery output is non-deterministic, which means you might receive a different result each time you execute the query. If you try to execute the following query several times without using cached results, you will got different results.

SELECT *
FROM `bigquery-samples.wikipedia_benchmark.Wiki1B`
LIMIT 5

Therefore, depends on how randomly you want to have the samples, this maybe a better solution.

Upvotes: 6

Related Questions