Chase
Chase

Reputation: 23

Sum values in a row with similar column names in SQL

I have an SQL table that looks like this:

ID SOURCE INPUT_A OUTPUT_A INPUT_B OUTPUT_B ... INPUT_Z OUTPUT_Z MOD_DATE
(String) (Number) (Number) (Number) (Number) (Number) ... (Number) (Number) (Date)
ABC-123 999 12.1 11.9 40.7 41.0 ... 1.1 0.1 2023-01-19
DEF-456 111 11.1 12.9 41.7 41.0 ... 1.2 2.1 2023-01-19

I would like to get a query result that looks like this:

ID INPUT_SUM OUTPUT_SUM
ABC-123 80.7 79.7
DEF-456 80.8 82.8

The long way to do it seems to be this:

SELECT ID,
       (INPUT_A + INPUT_B + ... + INPUT_Z) AS INPUT_SUM,
       (OUTPUT_A + OUTPUT_B + ... + OUTPUT_Z) AS OUTPUT_SUM
FROM DB_NAME.S_NAME.T_NAME

It gives the correct answer but seems prone to error in including all of the right column names in the right place.

To get lists of the relevant column names I was able to use this:

SELECT COLUMN_NAME
  FROM DB_NAME.INFORMATION_SCHEMA.COLUMNS
 WHERE LEFT(COLUMN_NAME, 5) = 'INPUT' AND TABLE_NAME = 'T_NAME'

I am unsure how to make use of those lists in improving the original query.

Upvotes: 2

Views: 1219

Answers (4)

Adrian White
Adrian White

Reputation: 1804

enter image description here

SET QUERY =( 
SELECT 'SELECT ID,SUM( '||
LISTAGG(CASE WHEN LEFT(COLUMN_NAME,5)='INPUT' THEN COLUMN_NAME END,'+' )||') AS INPUT_SUM,SUM(' || 
LISTAGG(CASE WHEN LEFT(COLUMN_NAME,6)='OUTPUT' THEN COLUMN_NAME END,'+' )||') AS OUTPUT_SUM FROM T_NAME GROUP BY ID'
FROM  INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'T_NAME');

EXECUTE IMMEDIATE ($QUERY); 

FUNCTIONS USED: EXECUTE IMMEDIATE() LISTAGG()

Making use of the information schema can be done with string aggregate functions of which there's heaps. Once you've got you SQL nice then simply pass into execute immediate().

Personally I'd go with Lukasz Szozda's answer as it's single pass and could handle inline views etc. Just fun to look at all the options, there's always multiple good answers.

Upvotes: 1

Fieldy
Fieldy

Reputation: 601

I often have data pivoted in this wide format to deal with made up of 100's or even 1000's of columns. Taking inspiration from R's tidyselect package I created a bunch of composable Javascript user-defined functions that work with the object form of a Row.

These were written before Snowflakes support for Scala, Java & Python UDFs. Your mileage may vary, but I found them to be little faster than the lateral flatten approach used to create a dynamic unpivot in my use-cases.

Here are two of the functions applicable to your question:

create or replace function obj_key_starts_with(
   O1 OBJECT,      -- OBJECT to apply this transformation over
   s_with VARCHAR  -- New KEY name starts with. NULL value uses empty string
    ) 
    returns VARIANT 
    language javascript 
    as $$
     let sw;
      if (S_WITH === undefined) {
        sw = '';
      } else {
        sw = S_WITH;
      }
    return Object.fromEntries(Object.entries(O1).filter(([k, v]) => k.startsWith(sw) )) 
    $$;

create or replace function obj_sum_across(
  O1 OBJECT      -- OBJECT to apply this transformation over
) 
returns FLOAT 
language javascript 
as $$
object_sum = Object.values(O1).reduce((a, b) => a + b, 0)
return object_sum
$$;

And heres the SQL to generate the desired result using the same sample data table from Lukasz Szozda's solution.

To sum across the columns

select
   ID,
   obj_sum_across(obj_key_starts_with(obj_row, 'INPUT'))::DECIMAL(4,2) input_row_sum, 
   obj_sum_across(obj_key_starts_with(obj_row, 'OUTPUT'))::DECIMAL(4,2) output_row_sum
from (Select ID, object_construct(*) obj_row from tab) s
;

And if you want to further aggregate the results across multiple rows:

Select
  ID,
  sum(input_row_sum) INPUT_SUM, 
  sum(output_row_sum) OUTPUT_SUM
from (Select
            ID,
            SOURCE,
            object_construct(*) obj_row,
            obj_sum_across(obj_key_starts_with(obj_row, 'INPUT'))::DECIMAL(4,2) input_row_sum, 
            obj_sum_across(obj_key_starts_with(obj_row, 'OUTPUT'))::DECIMAL(4,2) output_row_sum
      from tab) s
Group by 1,2;

To answer your original question about formulating SQL strings via querying the infomation_schema tables, the following SQL will get you the SQL you were looking for:

SELECT 
     'Select \n   '||
     'ID, \n   '||
     LISTAGG(CASE WHEN LEFT(COLUMN_NAME, 5) = 'INPUT' THEN COLUMN_NAME END,' + ') WITHIN GROUP (ORDER BY COLUMN_NAME) ||',\n   '||
     LISTAGG(CASE WHEN LEFT(COLUMN_NAME, 6) = 'OUTPUT' THEN COLUMN_NAME END,' + ')  WITHIN GROUP (ORDER BY COLUMN_NAME) ||'\n'||
     'FROM SIMON.COLUMN_SUM_ACROSS.TAB;' SQL_STRING
FROM SIMON.INFORMATION_SCHEMA.COLUMNS
 WHERE TABLE_NAME = 'TAB' ;

Upvotes: 1

Lukasz Szozda
Lukasz Szozda

Reputation: 175586

Proper data modeling should be preferred way to solve it.

Fortunately SQL is expressive enough to handle such scenario with "dynamic" UNPIVOT. Please be aware that the performance could be worse than explicitly stating column list (INPUT_A + INPUT_B + ... + INPUT_Z).

Sample data:

CREATE OR REPLACE TABLE tab(ID TEXT, SOURCE TEXT,
                            INPUT_A   DECIMAL(10,2), OUTPUT_A  DECIMAL(10,2),
                            INPUT_B   DECIMAL(10,2), OUTPUT_B  DECIMAL(10,2),
                            INPUT_Z   DECIMAL(10,2), OUTPUT_Z  DECIMAL(10,2),
                            MOD_DATE  TEXT)
AS
      SELECT 'ABC-123', 999, 12.1, 11.9, 40.7, 41.0, 1.1, 0.1, '2023-01-19'
UNION SELECT 'DEF-456', 111, 11.1, 12.9, 41.7, 41.0, 1.2, 2.1, '2023-01-19';

Query:

SELECT sub.ID, sub.SOURCE,
   SUM(CASE WHEN s.KEY LIKE 'INPUT%'  THEN VALUE::DECIMAL(10,2) END) AS INPUT_SUM,
   SUM(CASE WHEN s.KEY LIKE 'OUTPUT%' THEN VALUE::DECIMAL(10,2) END) AS OUTPUT_SUM
FROM (SELECT *, OBJECT_CONSTRUCT_KEEP_NULL(*) AS json FROM tab) AS sub
,TABLE(FLATTEN(INPUT=> sub.json)) AS s
GROUP BY sub.ID, sub.SOURCE;

Output:

enter image description here

Upvotes: 2

Thorsten Kettner
Thorsten Kettner

Reputation: 94859

The long way is the correct way. In your data model you have decided to have 26 different columns for input and output each, so live with that decision.

You could have made a separate table instead:

  • main_table (id, mod_date)
  • child_table(id, id_main_table, source, input, output)

This would have allowed aggregation:

select
  m.id,
  c.source,
  sum(c.input) as input_sum,
  sum(c.output) as output_sum
from main_table m
join child_table c on c.id_main_table = m.id
group by m.id, c.source
order by m.id, c.source;

Upvotes: 2

Related Questions