Postgresql - Apply a group by across multiple columns on a very large table

Question

I'm working with a very large table (~425 million rows, ~800 columns, ~750GB on disk) that looks like (4 columns included here):

 id  | c1 |   c2   |   c3   | c4
-----+----+--------+--------+----
 101 | t  | green  | small  | 10
 102 | t  | green  | small  | 11
 103 | f  | red    | medium | 12
 104 | f  | yellow | medium | 13
 105 | t  | blue   | medium | 14
 106 | t  | green  | large  | 14

What I'm trying to do is group the ids together by each unique value across each column. The output would look like:

colname | value  | ids
----------------------
c1      | t      | 101, 102, 105, 106
c1      | f      | 103, 104
c2      | green  | 101, 102, 106
c2      | red    | 103
c2      | yellow | 104
c2      | blue   | 105
c3      | small  | 101, 102
...

Each column will have between 2-15 distinct values, so there would be very many ids in each grouping given that there are a total of ~425 million ids. What I have decided to do is chunk up the ids into groups of a specific size and then keep track of the chunks, which would look like this (just showing column c1):

colname | value  | chunk | ids
------------------------------
c1      | t      | 1     | {101, 102, ...}
c1      | t      | 2     | {205, 206, ...}
c1      | t      | 3     | {331, 332, ...}
c1      | f      | 1     | {103, 104, ...}
...

The problem I'm facing is that I don't know how to retrieve these results across multiple columns at once. For a single column, this SQL would provide the above table:

select 'c1' as colname, tmp.c1 as value, tmp.chunk, array_agg(tmp.id) as ids
from (
  select id, c1, (row_number() over (partition by c1) - 1) / ARRAY_SIZE + 1 as chunk
  from my_large_table
) as tmp
group by (tmp.c2, tmp.chunk)

What I'm concerned about is having to run this query ~800 times (however many columns there are), each time reading ~425 million rows. I know that grouping sets provide a way of grouping by multiple different sets at the same time, meaning we can read the table once and have all of the results we're looking for, but I don't know how to do this in a way that 1) incorporates the chunking behavior and 2) keeps the relevant colname in the result set, which would be required for each grouping set in all of the grouping sets we are considering.

Additional context: ordering is unimportant for ids, columns, values and chunks in the desired final result set.

Any help would be much appreciated, thanks.

Postgresql - Apply a group by across multiple columns on a very large table

Answers (1)

Related Questions