syltruong
syltruong

Reputation: 2723

In BigQuery, how to random split query results?

In BigQuery (standard SQL), I would like to randomly split result lines from a query.

I currently have, for instance, for a table mytable, with columns (order_id, created_at, country):

for a 10%/90% split, on the one hand

SELECT 
   *
FROM
   `mytable`
WHERE RAND() <= 10/90

on the other hand

SELECT 
   *
FROM
   `mytable`
WHERE RAND() > 10/90

But this yields an inconsistent split. Is there a way to achieve this properly?

Upvotes: 2

Views: 2865

Answers (1)

syltruong
syltruong

Reputation: 2723

The solution is to use a hashing function on one column that discriminates uniquely each line of your source table (for instance here orderId).

BigQuery has a hashing function which target type is a signed INT64 (and source STRING or BYTES): FARM_FINGERPRINT (from here).

A solution is to encode each line according to its FARM_FINGERPRINT(orderId) which is a sample from a uniform distribution of all INT64 numbers.

For a given k, MOD(ABS(FARM_FINGERPRINT(orderId)),k) constitutes a uniform distribution of integers in [0,k-1] (Be wary that MOD(a,b) with a negative and b positive may return a negative number).

Thus, assume you want a 10%/90% split.

The queries for each of the splits would look like:

SELECT 
   *
FROM
   `mytable`
WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),10) = 0

and

SELECT 
   *
FROM
   `mytable`
WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),10) != 0

It can be generalised to any 1/k split:

SELECT 
   *
FROM
   `mytable`
WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),@k) = 0

and

SELECT 
   *
FROM
   `mytable`
WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),@k) != 0

Changing the value of the discriminative mod integer (0 in the above) allows to have up to k different splits (which can be very useful if you are doing multiple-fold cross validation in ML for instance).

Last but not least:

  • the folds are robust to data addition in the table
  • the folds remain unchanged, whenever the queries are ran

Upvotes: 6

Related Questions