Reputation: 2723
In BigQuery (standard SQL), I would like to randomly split result lines from a query.
The split needs to be consistent and should not change over time: meaning that every time the query is ran, the split should remain the same.
In particular, if data is added to the source table, data previously in one side of the split should remain in the same split.
The strategy should be able to handle different splitting ratios
I currently have, for instance, for a table mytable
, with columns (order_id
, created_at
, country
):
for a 10%/90% split, on the one hand
SELECT
*
FROM
`mytable`
WHERE RAND() <= 10/90
on the other hand
SELECT
*
FROM
`mytable`
WHERE RAND() > 10/90
But this yields an inconsistent split. Is there a way to achieve this properly?
Upvotes: 2
Views: 2865
Reputation: 2723
The solution is to use a hashing function on one column that discriminates uniquely each line of your source table (for instance here orderId
).
BigQuery has a hashing function which target type is a signed INT64
(and source STRING
or BYTES
): FARM_FINGERPRINT
(from here).
A solution is to encode each line according to its FARM_FINGERPRINT(orderId)
which is a sample from a uniform distribution of all INT64
numbers.
For a given k
, MOD(ABS(FARM_FINGERPRINT(orderId)),k)
constitutes a uniform distribution of integers in [0,k-1]
(Be wary that MOD(a,b)
with a
negative and b
positive may return a negative number).
Thus, assume you want a 10%/90% split.
The queries for each of the splits would look like:
SELECT
*
FROM
`mytable`
WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),10) = 0
and
SELECT
*
FROM
`mytable`
WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),10) != 0
It can be generalised to any 1/k split:
SELECT
*
FROM
`mytable`
WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),@k) = 0
and
SELECT
*
FROM
`mytable`
WHERE MOD(ABS(FARM_FINGERPRINT(orderId)),@k) != 0
Changing the value of the discriminative mod integer (0
in the above) allows to have up to k
different splits (which can be very useful if you are doing multiple-fold cross validation in ML for instance).
Last but not least:
Upvotes: 6