Reputation: 73
My SQL currently looks like this.
SELECT t1.field1,
t1.field2,
t1.field3,
t1.field4,
t1.field5,
t1.field6,
t1.field7,
t1.field8,
t2.field1,
t2.field2,
t2.field3,
t2.field4,
t2.field5,
t2.field6,
t2.field7,
t2.field8,
t2.field9,
t3.field1,
t4.field1,
t5.field1,
SUM(t6.field1),
MIN(t6.THEDATE)
FROM table1 t1
LEFT JOIN table2 t2
ON t1.field2 = t2.sameFieldName
LEFT JOIN table3 t3
ON t2.field9 = t3.sameFieldName
LEFT JOIN table4 t4
ON t1.field2 = t4.sameFieldName
AND t2.field1 = t4.sameFieldName
LEFT JOIN table5 t5
ON t4.field1 = t5.sameFieldName
LEFT JOIN table6 t6
ON t4.field1 = t6.sameFieldName
AND t4.colName1 = t6.sameFieldName
WHERE t6.THEDATE BETWEEN SYSDATE - 70 AND SYSDATE - 50
AND t1.field2 = 'SUBMIT'
AND t1.field3 LIKE 'H%'
GROUP BY t1.field1,
t1.field2,
t1.field3,
t1.field4,
t1.field5,
t1.field6,
t1.field7,
t1.field8,
t2.field1,
t2.field2,
t2.field3,
t2.field4,
t2.field5,
t2.field6,
t2.field7,
t2.field8,
t2.field9,
t3.field1,
t4.field1,
t5.field1;
The problem I have is I need to select with the condition based on the min date. But doing it this way will "show" the min date, but it will filter records based on whatever the last "theDate" value is. I know you cannot use agg functions in the where clause because the where only operates on a single record. So how can I get something that would work like this?
SELECT *, sum(somthing), min(theDate)
FROM Table
WHERE min(theDate) BETWEEN SYSDATE - 70 AND SYSDATE - 50
GROUP BY <<<ALL GROUP COLUMNS>>>
Upvotes: 0
Views: 12329
Reputation: 1456
The main performance drain may be unnecessary columns in GROUP BY
. This can happen if your Table
refers to a denormalized table:
EMP (EMP_ID*, DEPT_ID, DEPT_NAME, SAL, THEDATE)
or if Table
refers to a join, e.g.
EMP(EMP_ID*, DEPT_ID, SAL, THEDATE)
DEPT(DEPT_ID*, DEPT_NAME)
"Table" == EMP JOIN DEPT USING (DEPT_ID)
In either case, the query:
SELECT DEPT_ID, DEPT_NAME, SUM(SAL), MIN(THEDATE)
FROM EMP
GROUP BY DEPT_ID, DEPT_NAME
HAVING MIN(THEDATE) >= SYSDATE-70 AND MIN(THEDATE) < SYSDATE-50;
will experience all the overhead of grouping the DEPT_NAME
, even though DEPT_NAME
will always have same value for a given DEPT_ID
. In other words, DEPT_ID
is a candidate key of the selected columns. If the "*" in your SELECT *, SUM(whatever)
has one or more candidate keys (commonly all of the "_ID" columns) that determine unique values for all other columns, then it will be far more efficient to do something like this:
SELECT DEPT_ID, MAX(DEPT_NAME) DEPT_NAME, SUM(SAL), MIN(THEDATE)
FROM EMP USING (DEPT_ID)
GROUP BY DEPT_ID
HAVING MIN(THEDATE) >= SYSDATE-70 AND MIN(THEDATE) < SYSDATE-50;
The performance difference can be especially dramatic if your dependent columns are long strings.
Gordon Linoff's answer makes a good point (though I think his implementation can be further optimized, see below) - in some cases it can make sense to "pre-filter". Few rule-of-thumb indications that this might be faster:
1) majority (80% +) of rows have THEDATE older than 70 days
2) a single index on Table
which includes most/all GROUP BY
columns and, preferably, THEDATE
3) either a separate index on THEDATE
or, the THEDATE
is first column of the index noted in #2
4) alternatively for #3 - Table
is partitioned by THEDATE
(and index in #2 being a local index would be better still)
Basic "pre-filter logic": sum over all Table
rows which
1) don't belong to a grouping with any rows that are "too old"
2) do belong to a grouping with least 1 row that's "old enough but not too old"
3) the row itself is not "too old"
SELECT DEPT_ID, DEPT_NAME, SUM(SAL), MIN(THEDATE)
FROM EMP E1
WHERE NOT EXISTS
(SELECT 1 FROM EMP E2
WHERE E2.DEPT_ID = E1.DEPT_ID
AND E2.DEPT_NAME=E1.DEPT_NAME
AND E2.THEDATE < SYSDATE - 70)
AND EXISTS
(SELECT 1 FROM EMP E2
WHERE E2.DEPT_ID = E1.DEPT_ID
AND E2.DEPT_NAME=E1.DEPT_NAME
AND E2.THEDATE BETWEEN SYSDATE-70 AND SYSDATE - 50)
AND E1.THEDATE >= SYSDATE -70
GROUP BY DEPT_ID, DEPT_NAME;
Final note: If both the candidate-key-grouping and pre-filter optimizations seem applicable, they can be applied in tandem:
SELECT DEPT_ID, MAX(DEPT_NAME) DEPT_NAME, SUM(SAL), MIN(THEDATE)
FROM EMP E1
WHERE NOT EXISTS
(SELECT 1 FROM EMP E2
WHERE E2.DEPT_ID = E1.DEPT_ID
AND E2.THEDATE < SYSDATE - 70)
AND EXISTS
(SELECT 1 FROM EMP E2
WHERE E2.DEPT_ID = E1.DEPT_ID
AND E2.THEDATE BETWEEN SYSDATE-70 AND SYSDATE - 50)
AND E1.THEDATE >= SYSDATE -70
GROUP BY DEPT_ID;
Beyond that, there's probably not much more you can do to boost performance in the query (though PARALLELISM may be an option). To get the correct results any faster, you'd have to look at structural changes (mat views, indexes, partitioning options, etc) to support the query.
Upvotes: 0
Reputation: 1269513
Depending on the size of your data, it might be fastest to filter the records before doing the aggregation. The appropriate filter is:
SELECT *, sum(somthing), min(theDate)
FROM Table t
WHERE NOT EXISTS (SELECT 1 FROM table t2 WHERE . . . AND t2.thedate < SYSDATE - 70) AND
EXISTS (SELECT 1 FROM table t2 WHERE . . . AND t2.thedate <= sysdate - 50)
GROUP BY *
The . . .
are equality conditions based on the groups annotated using *
.
Upvotes: 1
Reputation: 132570
You could use analytic functions:
SELECT * FROM
(
SELECT *, sum(somthing) over(), row_number() over (order by theDate) as rn
FROM Table
WHERE theDate BETWEEN SYSDATE - 70 AND SYSDATE - 50
)
WHERE rn = 1;
You might use rank() instead of row_number(), but that could return more than 1 row if the minimum date value is used in more tha 1 row.
Upvotes: 0
Reputation: 231661
The syntax you posted isn't valid-- you can't group by *
and you'd need to have an alias on the select *
since you're selecting other columns. Assuming both of those are just artifacts of putting together a simplified example, you just need to use the having
clause
SELECT a.*, sum(something), min(theDate)
FROM table_name a
GROUP BY <<list of columns in a>>
HAVING min(theDate) BETWEEN sysdate - 70 AND sysdate - 50
Upvotes: 2