Reputation: 43
I'm using the RAND
function in bigquery to provide me with a random sample of data, and unioning it with another sample of the same dataset.
This is for a machine learning problem where I'm interested in one class more than the other.
I've recreated the logic using a public dataset.
SELECT
COUNT(1),
bigarticle
FROM
(
SELECT
1 as [bigarticle]
FROM [bigquery-public-data:samples.wikipedia]
WHERE num_characters > 50000
),
(
SELECT
0 as [bigarticle]
FROM [bigquery-public-data:samples.wikipedia]
WHERE (is_redirect is null) AND (RAND() < 0.01)
)
GROUP BY bigarticle
Most of the time this behaves as expected,
giving one row with the count of rows where num_characters
is more than 50k,
and giving another row with a count of a 1% sample of rows where is_redirect
is null.
(This is an approximation of the logic I use in my internal dataset).
If you run this query repeatedly, occasionally it gives unexpected results.
In this result set (bquijob_124ad56f_15da8af982e
) I only get a single row, containing the count of bigarticle
= 1.
Upvotes: 1
Views: 940
Reputation: 33705
RAND
does not use a deterministic seed. If you want deterministic results, you need to hash/fingerprint a column in the table and use a modulus to select a subset of values instead. Using legacy SQL:
#legacySQL
SELECT
COUNT(1),
bigarticle
FROM (
SELECT
1 as [bigarticle]
FROM [bigquery-public-data:samples.wikipedia]
WHERE num_characters > 50000
), (
SELECT
0 as [bigarticle]
FROM [bigquery-public-data:samples.wikipedia]
WHERE (is_redirect is null) AND HASH(title) % 100 = 0
)
GROUP BY bigarticle;
Using standard SQL in BigQuery, which is recommended since legacy SQL is not under active development:
#standardSQL
SELECT
COUNT(*),
bigarticle
FROM (
SELECT
1 as bigarticle
FROM `bigquery-public-data.samples.wikipedia`
WHERE num_characters > 50000
UNION ALL
SELECT
0 as bigarticle
FROM `bigquery-public-data.samples.wikipedia`
WHERE (is_redirect is null) AND MOD(FARM_FINGERPRINT(title), 100) = 0
)
GROUP BY bigarticle;
Upvotes: 3