Reputation: 23
I have an SQL table that looks like this:
ID | SOURCE | INPUT_A | OUTPUT_A | INPUT_B | OUTPUT_B | ... | INPUT_Z | OUTPUT_Z | MOD_DATE |
---|---|---|---|---|---|---|---|---|---|
(String) | (Number) | (Number) | (Number) | (Number) | (Number) | ... | (Number) | (Number) | (Date) |
ABC-123 | 999 | 12.1 | 11.9 | 40.7 | 41.0 | ... | 1.1 | 0.1 | 2023-01-19 |
DEF-456 | 111 | 11.1 | 12.9 | 41.7 | 41.0 | ... | 1.2 | 2.1 | 2023-01-19 |
I would like to get a query result that looks like this:
ID | INPUT_SUM | OUTPUT_SUM |
---|---|---|
ABC-123 | 80.7 | 79.7 |
DEF-456 | 80.8 | 82.8 |
The long way to do it seems to be this:
SELECT ID,
(INPUT_A + INPUT_B + ... + INPUT_Z) AS INPUT_SUM,
(OUTPUT_A + OUTPUT_B + ... + OUTPUT_Z) AS OUTPUT_SUM
FROM DB_NAME.S_NAME.T_NAME
It gives the correct answer but seems prone to error in including all of the right column names in the right place.
To get lists of the relevant column names I was able to use this:
SELECT COLUMN_NAME
FROM DB_NAME.INFORMATION_SCHEMA.COLUMNS
WHERE LEFT(COLUMN_NAME, 5) = 'INPUT' AND TABLE_NAME = 'T_NAME'
I am unsure how to make use of those lists in improving the original query.
Upvotes: 2
Views: 1219
Reputation: 1804
SET QUERY =(
SELECT 'SELECT ID,SUM( '||
LISTAGG(CASE WHEN LEFT(COLUMN_NAME,5)='INPUT' THEN COLUMN_NAME END,'+' )||') AS INPUT_SUM,SUM(' ||
LISTAGG(CASE WHEN LEFT(COLUMN_NAME,6)='OUTPUT' THEN COLUMN_NAME END,'+' )||') AS OUTPUT_SUM FROM T_NAME GROUP BY ID'
FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'T_NAME');
EXECUTE IMMEDIATE ($QUERY);
FUNCTIONS USED: EXECUTE IMMEDIATE() LISTAGG()
Making use of the information schema can be done with string aggregate functions of which there's heaps. Once you've got you SQL nice then simply pass into execute immediate().
Personally I'd go with Lukasz Szozda's answer as it's single pass and could handle inline views etc. Just fun to look at all the options, there's always multiple good answers.
Upvotes: 1
Reputation: 601
I often have data pivoted in this wide format to deal with made up of 100's or even 1000's of columns. Taking inspiration from R's tidyselect package I created a bunch of composable Javascript user-defined functions that work with the object form of a Row.
These were written before Snowflakes support for Scala, Java & Python UDFs. Your mileage may vary, but I found them to be little faster than the lateral flatten approach used to create a dynamic unpivot in my use-cases.
Here are two of the functions applicable to your question:
create or replace function obj_key_starts_with(
O1 OBJECT, -- OBJECT to apply this transformation over
s_with VARCHAR -- New KEY name starts with. NULL value uses empty string
)
returns VARIANT
language javascript
as $$
let sw;
if (S_WITH === undefined) {
sw = '';
} else {
sw = S_WITH;
}
return Object.fromEntries(Object.entries(O1).filter(([k, v]) => k.startsWith(sw) ))
$$;
create or replace function obj_sum_across(
O1 OBJECT -- OBJECT to apply this transformation over
)
returns FLOAT
language javascript
as $$
object_sum = Object.values(O1).reduce((a, b) => a + b, 0)
return object_sum
$$;
And heres the SQL to generate the desired result using the same sample data table from Lukasz Szozda's solution.
To sum across the columns
select
ID,
obj_sum_across(obj_key_starts_with(obj_row, 'INPUT'))::DECIMAL(4,2) input_row_sum,
obj_sum_across(obj_key_starts_with(obj_row, 'OUTPUT'))::DECIMAL(4,2) output_row_sum
from (Select ID, object_construct(*) obj_row from tab) s
;
And if you want to further aggregate the results across multiple rows:
Select
ID,
sum(input_row_sum) INPUT_SUM,
sum(output_row_sum) OUTPUT_SUM
from (Select
ID,
SOURCE,
object_construct(*) obj_row,
obj_sum_across(obj_key_starts_with(obj_row, 'INPUT'))::DECIMAL(4,2) input_row_sum,
obj_sum_across(obj_key_starts_with(obj_row, 'OUTPUT'))::DECIMAL(4,2) output_row_sum
from tab) s
Group by 1,2;
To answer your original question about formulating SQL strings via querying the infomation_schema tables, the following SQL will get you the SQL you were looking for:
SELECT
'Select \n '||
'ID, \n '||
LISTAGG(CASE WHEN LEFT(COLUMN_NAME, 5) = 'INPUT' THEN COLUMN_NAME END,' + ') WITHIN GROUP (ORDER BY COLUMN_NAME) ||',\n '||
LISTAGG(CASE WHEN LEFT(COLUMN_NAME, 6) = 'OUTPUT' THEN COLUMN_NAME END,' + ') WITHIN GROUP (ORDER BY COLUMN_NAME) ||'\n'||
'FROM SIMON.COLUMN_SUM_ACROSS.TAB;' SQL_STRING
FROM SIMON.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'TAB' ;
Upvotes: 1
Reputation: 175586
Proper data modeling should be preferred way to solve it.
Fortunately SQL is expressive enough to handle such scenario with "dynamic" UNPIVOT. Please be aware that the performance could be worse than explicitly stating column list (INPUT_A + INPUT_B + ... + INPUT_Z)
.
Sample data:
CREATE OR REPLACE TABLE tab(ID TEXT, SOURCE TEXT,
INPUT_A DECIMAL(10,2), OUTPUT_A DECIMAL(10,2),
INPUT_B DECIMAL(10,2), OUTPUT_B DECIMAL(10,2),
INPUT_Z DECIMAL(10,2), OUTPUT_Z DECIMAL(10,2),
MOD_DATE TEXT)
AS
SELECT 'ABC-123', 999, 12.1, 11.9, 40.7, 41.0, 1.1, 0.1, '2023-01-19'
UNION SELECT 'DEF-456', 111, 11.1, 12.9, 41.7, 41.0, 1.2, 2.1, '2023-01-19';
Query:
SELECT sub.ID, sub.SOURCE,
SUM(CASE WHEN s.KEY LIKE 'INPUT%' THEN VALUE::DECIMAL(10,2) END) AS INPUT_SUM,
SUM(CASE WHEN s.KEY LIKE 'OUTPUT%' THEN VALUE::DECIMAL(10,2) END) AS OUTPUT_SUM
FROM (SELECT *, OBJECT_CONSTRUCT_KEEP_NULL(*) AS json FROM tab) AS sub
,TABLE(FLATTEN(INPUT=> sub.json)) AS s
GROUP BY sub.ID, sub.SOURCE;
Output:
Upvotes: 2
Reputation: 94859
The long way is the correct way. In your data model you have decided to have 26 different columns for input and output each, so live with that decision.
You could have made a separate table instead:
This would have allowed aggregation:
select
m.id,
c.source,
sum(c.input) as input_sum,
sum(c.output) as output_sum
from main_table m
join child_table c on c.id_main_table = m.id
group by m.id, c.source
order by m.id, c.source;
Upvotes: 2