danikpsh
danikpsh

Reputation: 13

Is there a way to shuffle rows in a table into distinctive fixed size chunks using SQL only?

I have a very big table (~300 million rows) with the following structure: my_table(id, group, chunk, new_id), where chunk and new_id are set to NULL.

I want to set the rows of each group to a random chunk with distinct new_id in the chunk. Each chunk should be of fixed size of 100.

For example if group A has 1278 rows, they should go into 13 chunks (0-12), 12 chunks with 100 rows s.t. new_id are in range (0-99) and another single chunk with 78 rows s.t. new_id are in range (0-77).

The organization into chunks and within the chunks should be a random permutation where each row in A is assigned with a unique (chunk, new_id) tuple.

I'm successfully doing it using pandas but it takes hours, mostly due to memory and bandwidth limitations.

Is it possible to execute using only a SQL query? I'm using postgres 9.6.

Upvotes: 1

Views: 241

Answers (1)

GMB
GMB

Reputation: 222482

You could do this with row_number():

select id, group, rn / 100 chunk, rn % 100 new_id
from (select t.*, row_number() over(order by random()) - 1 rn from mytable t) t

The inner query assigns a random integer number to each record (starting at 0). The outer query does arithmetic to compute the chunk and new id.

If you want an update query:

update mytable t set chunk = x.rn / 3, new_id = x.rn % 3
from (select id, row_number() over(order by random()) - 1 rn from mytable t) x
where x.id = t.id

Demo on DB Fiddle for a dataset of 20 records with chunks of 3 records .

Before:

id | grp | chunk | new_id
-: | --: | ----: | -----:
 1 |   1 |  nullnull
 2 |   2 |  nullnull
 3 |   3 |  nullnull
 4 |   4 |  nullnull
 5 |   5 |  nullnull
 6 |   6 |  nullnull
 7 |   7 |  nullnull
 8 |   8 |  nullnull
 9 |   9 |  nullnull
10 |  10 |  nullnull
11 |  11 |  nullnull
12 |  12 |  nullnull
13 |  13 |  nullnull
14 |  14 |  nullnull
15 |  15 |  nullnull
16 |  16 |  nullnull
17 |  17 |  nullnull
18 |  18 |  nullnull
19 |  19 |  nullnull
20 |  20 |  nullnull

After:

id | grp | chunk | new_id
-: | --: | ----: | -----:
19 |  19 |     0 |      0
11 |  11 |     0 |      1
20 |  20 |     0 |      2
12 |  12 |     1 |      0
14 |  14 |     1 |      1
17 |  17 |     1 |      2
 3 |   3 |     2 |      0
 8 |   8 |     2 |      1
 5 |   5 |     2 |      2
13 |  13 |     3 |      0
10 |  10 |     3 |      1
 2 |   2 |     3 |      2
16 |  16 |     4 |      0
18 |  18 |     4 |      1
 6 |   6 |     4 |      2
 1 |   1 |     5 |      0
15 |  15 |     5 |      1
 7 |   7 |     5 |      2
 4 |   4 |     6 |      0
 9 |   9 |     6 |      1

Upvotes: 1

Related Questions