Soumajyoti Sarkar
Soumajyoti Sarkar

Reputation: 593

Generate a random number for each group and assign it to all rows in the group

I have a table in the form

ID     | DOC 
-------------
id1    | d1  
-------------
id1    | d2  
-------------
id2    | d3 
-------------
id2    | d4
-------------
id3    | d5
-------------

The goal is to group the table by ID and for each group, select a random number from the number of groups (in this case, select a random number from [1, 3]) and assign all rows of each group one number. One possible configuration is

ID     | DOC | GROUP_NUM
--------------------------
id1    | d1  | 2
--------------------------
id1    | d2  | 2
--------------------------
id2    | d3  | 1
--------------------------
id2    | d4  | 1
--------------------------
id3    | d5  | 3
--------------------------

I was thinking of using ROW_NUMBER() and PARTITION() functions. What is a better way to go about it considering the table in Bigquery is quite big?

Upvotes: 0

Views: 1645

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Consider below approach

with array_with_random_nums as (
  select array_agg(num order by rand()) nums
  from (select count(distinct id) cnt from `project.dataset.table`), 
  unnest(generate_array(1,cnt)) num  
)
select t.*, 
  nums[offset(dense_rank() over(order by id) - 1)] group_num
from `project.dataset.table` t, array_with_random_nums

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269873

If the random number can be sequential, you can use dense_rank():

select t.*, dense_rank() over (order by id) as group_num
from t;

Or for a bit more randomness:

select t.*,
       dense_rank() over (order by farm_fingerprint(cast(id as string)), id) as group_num
from t;

Alternatively, a separate calculation by id might be simplest:

select *
from t join
     (select id,
             dense_rank() over (order by rand()) as group_num
      from t
      group by id
     ) tt
     using (id)

Upvotes: 1

Related Questions