Reputation: 593
I have a table in the form
ID | DOC
-------------
id1 | d1
-------------
id1 | d2
-------------
id2 | d3
-------------
id2 | d4
-------------
id3 | d5
-------------
The goal is to group the table by ID and for each group, select a random number from the number of groups (in this case, select a random number from [1, 3]) and assign all rows of each group one number. One possible configuration is
ID | DOC | GROUP_NUM
--------------------------
id1 | d1 | 2
--------------------------
id1 | d2 | 2
--------------------------
id2 | d3 | 1
--------------------------
id2 | d4 | 1
--------------------------
id3 | d5 | 3
--------------------------
I was thinking of using ROW_NUMBER() and PARTITION() functions. What is a better way to go about it considering the table in Bigquery is quite big?
Upvotes: 0
Views: 1645
Reputation: 172993
Consider below approach
with array_with_random_nums as (
select array_agg(num order by rand()) nums
from (select count(distinct id) cnt from `project.dataset.table`),
unnest(generate_array(1,cnt)) num
)
select t.*,
nums[offset(dense_rank() over(order by id) - 1)] group_num
from `project.dataset.table` t, array_with_random_nums
Upvotes: 0
Reputation: 1269873
If the random number can be sequential, you can use dense_rank()
:
select t.*, dense_rank() over (order by id) as group_num
from t;
Or for a bit more randomness:
select t.*,
dense_rank() over (order by farm_fingerprint(cast(id as string)), id) as group_num
from t;
Alternatively, a separate calculation by id
might be simplest:
select *
from t join
(select id,
dense_rank() over (order by rand()) as group_num
from t
group by id
) tt
using (id)
Upvotes: 1