ewannic
ewannic

Reputation: 43

Random sample in bigquery gives inconsistent results

I'm using the RAND function in bigquery to provide me with a random sample of data, and unioning it with another sample of the same dataset. This is for a machine learning problem where I'm interested in one class more than the other.

I've recreated the logic using a public dataset.

SELECT
COUNT(1),
bigarticle
FROM
(
  SELECT 
  1 as [bigarticle]
  FROM [bigquery-public-data:samples.wikipedia]
  WHERE num_characters > 50000
  ),
  (
  SELECT 
  0 as [bigarticle]
  FROM [bigquery-public-data:samples.wikipedia]
  WHERE (is_redirect is null) AND (RAND() < 0.01)
  )

GROUP BY bigarticle

Most of the time this behaves as expected, giving one row with the count of rows where num_characters is more than 50k, and giving another row with a count of a 1% sample of rows where is_redirect is null. (This is an approximation of the logic I use in my internal dataset).

If you run this query repeatedly, occasionally it gives unexpected results. In this result set (bquijob_124ad56f_15da8af982e) I only get a single row, containing the count of bigarticle = 1.

Upvotes: 1

Views: 940

Answers (1)

Elliott Brossard
Elliott Brossard

Reputation: 33705

RAND does not use a deterministic seed. If you want deterministic results, you need to hash/fingerprint a column in the table and use a modulus to select a subset of values instead. Using legacy SQL:

#legacySQL
SELECT
  COUNT(1),
  bigarticle
FROM (
  SELECT
    1 as [bigarticle]
  FROM [bigquery-public-data:samples.wikipedia]
  WHERE num_characters > 50000
), (
  SELECT 
    0 as [bigarticle]
  FROM [bigquery-public-data:samples.wikipedia]
  WHERE (is_redirect is null) AND HASH(title) % 100 = 0
)
GROUP BY bigarticle;

Using standard SQL in BigQuery, which is recommended since legacy SQL is not under active development:

#standardSQL
SELECT
  COUNT(*),
  bigarticle
FROM (
  SELECT
    1 as bigarticle
  FROM `bigquery-public-data.samples.wikipedia`
  WHERE num_characters > 50000
  UNION ALL
  SELECT
    0 as bigarticle
  FROM `bigquery-public-data.samples.wikipedia`
  WHERE (is_redirect is null) AND MOD(FARM_FINGERPRINT(title), 100) = 0
)
GROUP BY bigarticle;

Upvotes: 3

Related Questions