Zionmoose
Zionmoose

Reputation: 73

Oracle select records based on a min date where min date between

My SQL currently looks like this.

SELECT t1.field1,
       t1.field2,
       t1.field3,
       t1.field4,
       t1.field5,
       t1.field6,
       t1.field7,
       t1.field8,
       t2.field1,
       t2.field2,
       t2.field3,
       t2.field4,
       t2.field5,
       t2.field6,
       t2.field7,
       t2.field8,
       t2.field9,
       t3.field1,
       t4.field1,
       t5.field1,
       SUM(t6.field1),
       MIN(t6.THEDATE) 

  FROM table1 t1
    LEFT JOIN table2 t2
      ON t1.field2 = t2.sameFieldName
      LEFT JOIN table3 t3
        ON t2.field9 = t3.sameFieldName
        LEFT JOIN table4 t4
          ON t1.field2 = t4.sameFieldName
          AND t2.field1 = t4.sameFieldName
          LEFT JOIN table5 t5
            ON t4.field1 = t5.sameFieldName
            LEFT JOIN table6 t6
              ON t4.field1 = t6.sameFieldName
              AND t4.colName1 = t6.sameFieldName

  WHERE t6.THEDATE BETWEEN SYSDATE - 70 AND SYSDATE - 50
    AND t1.field2 = 'SUBMIT'
    AND t1.field3 LIKE 'H%'

  GROUP BY t1.field1,
           t1.field2,
           t1.field3,
           t1.field4,
           t1.field5,
           t1.field6,
           t1.field7,
           t1.field8,
           t2.field1,
           t2.field2,
           t2.field3,
           t2.field4,
           t2.field5,
           t2.field6,
           t2.field7,
           t2.field8,
           t2.field9,
           t3.field1,
           t4.field1,
           t5.field1;

The problem I have is I need to select with the condition based on the min date. But doing it this way will "show" the min date, but it will filter records based on whatever the last "theDate" value is. I know you cannot use agg functions in the where clause because the where only operates on a single record. So how can I get something that would work like this?

SELECT *, sum(somthing), min(theDate)
FROM Table
WHERE min(theDate) BETWEEN SYSDATE - 70 AND SYSDATE - 50
GROUP BY <<<ALL GROUP COLUMNS>>>

Upvotes: 0

Views: 12329

Answers (4)

KevinKirkpatrick
KevinKirkpatrick

Reputation: 1456

The main performance drain may be unnecessary columns in GROUP BY. This can happen if your Table refers to a denormalized table:

EMP (EMP_ID*, DEPT_ID, DEPT_NAME, SAL, THEDATE)

or if Table refers to a join, e.g.

EMP(EMP_ID*, DEPT_ID, SAL, THEDATE)
DEPT(DEPT_ID*, DEPT_NAME)

"Table" == EMP JOIN DEPT USING (DEPT_ID)

In either case, the query:

SELECT DEPT_ID, DEPT_NAME, SUM(SAL), MIN(THEDATE)
FROM   EMP
GROUP BY DEPT_ID, DEPT_NAME
HAVING MIN(THEDATE) >= SYSDATE-70 AND MIN(THEDATE) < SYSDATE-50;

will experience all the overhead of grouping the DEPT_NAME, even though DEPT_NAME will always have same value for a given DEPT_ID. In other words, DEPT_ID is a candidate key of the selected columns. If the "*" in your SELECT *, SUM(whatever) has one or more candidate keys (commonly all of the "_ID" columns) that determine unique values for all other columns, then it will be far more efficient to do something like this:

SELECT DEPT_ID, MAX(DEPT_NAME) DEPT_NAME, SUM(SAL), MIN(THEDATE)
FROM   EMP USING (DEPT_ID)
GROUP BY DEPT_ID
HAVING MIN(THEDATE) >= SYSDATE-70 AND MIN(THEDATE) < SYSDATE-50;

The performance difference can be especially dramatic if your dependent columns are long strings.

Gordon Linoff's answer makes a good point (though I think his implementation can be further optimized, see below) - in some cases it can make sense to "pre-filter". Few rule-of-thumb indications that this might be faster:
1) majority (80% +) of rows have THEDATE older than 70 days
2) a single index on Table which includes most/all GROUP BY columns and, preferably, THEDATE
3) either a separate index on THEDATE or, the THEDATE is first column of the index noted in #2
4) alternatively for #3 - Table is partitioned by THEDATE (and index in #2 being a local index would be better still)

Basic "pre-filter logic": sum over all Table rows which
1) don't belong to a grouping with any rows that are "too old"
2) do belong to a grouping with least 1 row that's "old enough but not too old"
3) the row itself is not "too old"

SELECT  DEPT_ID, DEPT_NAME, SUM(SAL), MIN(THEDATE)
FROM    EMP E1
WHERE   NOT EXISTS 
            (SELECT 1 FROM EMP E2 
             WHERE E2.DEPT_ID = E1.DEPT_ID 
                   AND E2.DEPT_NAME=E1.DEPT_NAME
                   AND E2.THEDATE < SYSDATE - 70)
        AND EXISTS
            (SELECT 1 FROM EMP E2 
             WHERE E2.DEPT_ID = E1.DEPT_ID 
                   AND E2.DEPT_NAME=E1.DEPT_NAME
                   AND E2.THEDATE BETWEEN SYSDATE-70 AND SYSDATE - 50)
       AND E1.THEDATE >= SYSDATE -70
GROUP BY DEPT_ID, DEPT_NAME;

Final note: If both the candidate-key-grouping and pre-filter optimizations seem applicable, they can be applied in tandem:

SELECT  DEPT_ID, MAX(DEPT_NAME) DEPT_NAME, SUM(SAL), MIN(THEDATE)
FROM    EMP E1
WHERE   NOT EXISTS 
            (SELECT 1 FROM EMP E2 
             WHERE E2.DEPT_ID = E1.DEPT_ID 
                   AND E2.THEDATE < SYSDATE - 70)
        AND EXISTS
            (SELECT 1 FROM EMP E2 
             WHERE E2.DEPT_ID = E1.DEPT_ID 
                   AND E2.THEDATE BETWEEN SYSDATE-70 AND SYSDATE - 50)
       AND E1.THEDATE >= SYSDATE -70
GROUP BY DEPT_ID;

Beyond that, there's probably not much more you can do to boost performance in the query (though PARALLELISM may be an option). To get the correct results any faster, you'd have to look at structural changes (mat views, indexes, partitioning options, etc) to support the query.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269513

Depending on the size of your data, it might be fastest to filter the records before doing the aggregation. The appropriate filter is:

SELECT *, sum(somthing), min(theDate)
FROM Table t
WHERE NOT EXISTS (SELECT 1 FROM table t2 WHERE . . . AND t2.thedate < SYSDATE - 70) AND
      EXISTS (SELECT 1 FROM table t2 WHERE . . . AND t2.thedate <= sysdate - 50)
GROUP BY *

The . . . are equality conditions based on the groups annotated using *.

Upvotes: 1

Tony Andrews
Tony Andrews

Reputation: 132570

You could use analytic functions:

SELECT * FROM
(
SELECT *, sum(somthing) over(), row_number() over (order by theDate) as rn
FROM Table
WHERE theDate BETWEEN SYSDATE - 70 AND SYSDATE - 50
)
WHERE rn = 1;

You might use rank() instead of row_number(), but that could return more than 1 row if the minimum date value is used in more tha 1 row.

Upvotes: 0

Justin Cave
Justin Cave

Reputation: 231661

The syntax you posted isn't valid-- you can't group by * and you'd need to have an alias on the select * since you're selecting other columns. Assuming both of those are just artifacts of putting together a simplified example, you just need to use the having clause

SELECT a.*, sum(something), min(theDate)
  FROM table_name a
 GROUP BY <<list of columns in a>>
HAVING min(theDate) BETWEEN sysdate - 70 AND sysdate - 50

Upvotes: 2

Related Questions