Reputation: 35
In a query, I did a GROUP BY on a date field that produced these summary results. The Query was like this:
SELECT
(CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END) AS BILLING_SOURCE, std.created_on, COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM onebiller.t_std_in_detail_his std
INNER JOIN onebiller.t_std_in_header h
ON h.job_id = std.job_id
WHERE std.invoice_number IS NOT NULL
GROUP BY (CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END), std.created_on
ORDER BY std.created_on ASC
Note these results for 3 datetimes on 19-Mar-2021 from the created_on field.
I then used TRUNC(created_on) to try to group all the records from a single day. This was the updated query:
SELECT
(CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END) AS BILLING_SOURCE, TRUNC(std.created_on), COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM onebiller.t_std_in_detail_his std
INNER JOIN onebiller.t_std_in_header h
ON h.job_id = std.job_id
WHERE std.invoice_number IS NOT NULL
GROUP BY (CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END), TRUNC(std.created_on)
ORDER BY TRUNC(std.created_on) ASC
I was expecting a result that would sum the 3 highlighted rows from the first query (2+165+164) instead I received a count of 166 for 19-Mar-2021. Why didn't I get the sum of (2+165+164)?
Upvotes: 0
Views: 56
Reputation: 191245
Because you have the same invoice number repeated at different times on the same day. As a simpler example, say you have:
CREATED_ON | INVOICE_NUMBER |
---|---|
2021-03-19 09:00:00 | 1 |
2021-03-19 09:00:00 | 2 |
2021-03-19 12:00:00 | 2 |
2021-03-19 15:00:00 | 1 |
2021-03-19 15:00:00 | 2 |
2021-03-19 15:00:00 | 3 |
That shows 6 rows, but only three distinct invoice numbers - 1, 2 and 3.
A simplified version of your first query gives:
SELECT std.created_on,
COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM std
GROUP BY std.created_on
ORDER BY std.created_on ASC
CREATED_ON | COUNT_OF_INVOICES |
---|---|
2021-03-19 09:00:00 | 2 |
2021-03-19 12:00:00 | 1 |
2021-03-19 15:00:00 | 3 |
The sum of those counts currently matches the number of rows in the table, 6. (I haven't included any duplicates at the same time, so the distinct
isn't doing anything at the moment, again to keep it simple.) The row for 09:00 counts invoice numbers 1 and 2; the row for 12:00 only counts 2; and the row for 15:00 counts 1, 2 and 3. The counts in all three rows include a count for invoice number 2, the first and third include a count for invoice number 1 - so the same invoice numbers are being counted multiple times.
SELECT TRUNC(std.created_on) as created_on,
COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM std
GROUP BY TRUNC(std.created_on)
ORDER BY TRUNC(std.created_on) ASC
CREATED_ON | COUNT_OF_INVOICES |
---|---|
2021-03-19 00:00:00 | 3 |
Now the single result is three, because that's how many distinct invoice numbers there are that day - it's now counting 1, 2 and 3 once each, not 2x2, 3x2 and 1x3.
If you didn't have the distinct
then the second query would also get 6, because it wouldn't be eliminating the duplicates seen in the first query.
Upvotes: 2