yankee
yankee

Reputation: 40830

Can I express in SQL that an aggregation has only a single value?

Suppose I have a query like this:

select *
from (values
--   A   B   C
    (1, 11, 92),
    (1, 11, 117),
    (2, 11, 97),
    (3, 22, 65)
) x(a,b,c)

And then assume that we know that for each value in column A the value of column B is guaranteed to be the same. So it is impossible that the rows (5, 6, 7) and (5, 7, 7) both exist in the source data because in the example there are two different Values for B (6,7) which both reference A=5.

Now I would like to do some Aggregation like this on the data:

select a, b, max(c)
from (values
    (1, 11, 92),
    (1, 11, 117),
    (2, 11, 97),
    (3, 22, 65)
) x(a,b,c)
group by a

This will of course generate the error:

ERROR: column "x.b" must appear in the GROUP BY clause or be used in an aggregate function

As a workaround I can of course just write

select a, max(b), max(c) ...

Or I could add column b to the group by-clause. Both will work just fine. However it has two drawbacks:

  1. It is confusing to read. When reviewing such a query I automatically wonder either why I am selecting only the max-Value for b or I wonder why I am grouping by this value. I need to extensively make sure that the assumption that the added max call or the added column in the group by clause is actually "do nothing" in a code review.
  2. In case I am mistaken with that assumption this could cause severe bugs which are very hard to detect.

Is there a more expressive way to write that there will be only a single possible value in this group? E.g. an aggregation function that returns a single value, but throws an error if multiple values are present.

(In case you are wondering about the use case: I need to import some very large CSV files which are formatted like this. In the first step I create a temporary table and just load all values into that table. Then in the second step I do some aggregation and insertion queries directly from the temp table. The CSV files I need to import do have these characteristics that I am assuming.)

Upvotes: 0

Views: 170

Answers (2)

James K. Lowden
James K. Lowden

Reputation: 7837

There are a few things you can do to make sure your results are valid, but you'll have to check in a couple of different ways.

Let's start with a query that produces correct rows, those meeting your assumptions.

with T (a, b, c) as (
select * from
   ( values
    (1, 11, 92),
    (1, 11, 117),
    (2, 11, 97),
    (3, 22, 65) ) x(a,b,c)
) 
select count(*) as N, A.a, min(B.b) as b, A.c
from (select a, max(c) as c from T group by a) as A
join (select distinct a, b from T) as B
on A.a = B.a
group by A.a, A.c
having count(*) = 1

(That's roughly right. I didn't run it, and your DBMS may have slightly different CTE syntax.)

That gets you the max(c) for rows for which a:b have 1:1 cardinality. To look for rows that don't meet that test -- an error -- run the same query with count != 1. Or remove the HAVING, and test N in your program:

select count(*) as N, a, min(b) as b, max(c) as c
from ( values
    (1, 11, 92),
    (1, 11, 117),
    (2, 11, 97),
    (3, 22, 65) ) x(a,b,c)
group by a

If you are fairly certain about the a:b cardinality, declare a UNIQUE constraint in your table, and thereby ensure min(b) === b.

It is confusing to read

Perhaps. Someplace you have to say, a:b should be 1:1, and here is where we reject rows that don't meet that test.

Upvotes: 0

Clodoaldo Neto
Clodoaldo Neto

Reputation: 125434

distinct on:

select distinct on (a) *
from t
order by a, c desc

Upvotes: 1

Related Questions