SQL How to Extracting different data from two columns

Question

I have a view EMPLOYEE with report dates that captures employee information on a daily basis over a number of years.

I have written this sql (SAP HANA) as I am a beginner but this is proving to be very very slow and resource intensive and not usable although it works.

I want to capture between two report dates - all active employees on their last date and all withdrawn employees with a end date greater than first report date and first date they were withdrawn.

select E.EMPLOYEEID
, E.STATUS
, E.STARTDATE
, E.ENDDATE
, E.REPORTDATE
FROM
(
    SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate  
    FROM “EMPLOYEETABLE”
        WHERE REPORTDATE>= Date’2019-04-01’ AND REPORTDATE<=Date’2019-07-01’ AND STATUS='Active'
        GROUP BY EMPLOYEEID
) r
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID =r. EMPLOYEEID AND E.REPORTDATE=r.MaxDate
UNION ALL
select E.EMPLOYEEID
, E.STATUS
, E.STARTDATE
, E.ENDDATE
, E.REPORTDATE
FROM
(
    SELECT EMPLOYEEID, Min(REPORTDATE) as MinDate  
    FROM "EMPLOYEETABLE"
        WHERE REPORTDATE>= date’2019-04-01’ AND REPORTDATE<=Date’2019-07-01’ AND STATUS='Withdrawn' AND ENDDATE>= Date’2019-04-01’
        GROUP BY EMPLOYEEID
) w
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID =w. EMPLOYEEID AND E.REPORTDATE=w.MinDate

Can anyone help to write this more efficiently, please?

Sample data set

Reportdate  EmployeeID  startdate   enddate     status
01/04/2019  Steve       12/02/2012  Null        Active
01/04/2019  Don         15/06/2016  Null        Active
01/04/2019  John        14/03/2015  01/04/2019  Withdrawn
01/04/2019  Anna        12/05/2017  Null        Active
02/04/2019  Steve       12/02/2012  Null        Active
02/04/2019  Don         15/06/2016  Null        Active
02/04/2019  John        14/03/2015  01/04/2019  Withdrawn
02/04/2019  Anna        12/05/2017  Null        Active
03/04/2019  Steve       12/02/2012  Null        Active
03/04/2019  Don         15/06/2016  Null        Active
03/04/2019  John        14/03/2015  01/04/2019  Withdrawn
03/04/2019  Anna        12/05/2017  03/04/2019  Withdrawn

Desired Output

Reportdate  EmployeeID  startdate   enddate     status
03/04/2019  Steve       12/02/2012  Null        Active
03/04/2019  Don         15/06/2016  Null        Active
01/04/2019  Jon         14/03/2015  01/04/2019  Withdrawn
03/04/2019  Anna        12/05/2017  03/04/2019  Withdrawn

Obviously my dataset is very big as same employee appears every day and new employees added.

Lars Br. · Accepted Answer

Splitting up the query into units that each cover a specific aspect of the report/query-logic as indicated by davidc2p is a very good idea.

But there is no need for temporary tables to do that; common table expressions (CTE) aka "WITH CLAUSE" is sufficient for that.

Structure of the query

The really important insight into this is to see that the EMPLOYEETABLE is a snapshot table with capturing the STATUS of all employees every single date.

For the query, there is the general condition to consider only snapshots within a certain time frame.

Based on this "timeboxed" dataset, the query now deals with employees (as opposed to snapshots!) and their most recent status (within in the "timeboxed dataset).

This observation allows to easily determine the MAX()-STATUS for each employee. Since there is a special condition for "WIDTHDRAWN" employees that their respective ENDDATE needs to be on or after the start of the reporting time frame, the two distinct employee groups/sets/cohorts need their own subquery.

These two subqueries return for each employee which record (EMPLOYEEID+REPORTDATE serves as a unique key to the report-records) should be reported back as the result of the query.

To produce the output, the two employee groups get merged (UNION ALL) and are then used as the final filter/selector for all records to be returned from the base table.

the rewritten query

with report_base as (
-- all records relevant to the reporting timeframe)
    select
        reportdate, EmployeeID
      , startdate, enddate, status
    from
        employeetable
    where 
          reportdate >= date'2019-04-01' 
      and reportdate <= date'2019-07-01')

, active_employees as (
-- all employees with most recent status in reporting timeframe = ACTIVE 
-- should be DISJUNCT from WITHDRAWN_EMPLOYEES
    select
        employeeid
      , max(reportdate) as reportdate
    from 
        report_base
    group by 
        employeeid
    having
        max(status)='Active')

, withdrawn_employees as (
-- all employees with most recent status in reporting timefrawm = WITHDRAWN
-- the ENDDATE should be on or after the start of the reporting timeframe
-- should be DISJUNCT from ACTIVE_EMPLOYEES)
     select
        employeeid
      , min(reportdate) as reportdate
    from 
        report_base
    where 
        enddate >= date'2019-04-01'
    group by 
        employeeid
    having
        max(status)='Withdrawn')

, report_records as(
-- all records that should be returned)
        select
            employeeid, reportdate
        from 
           active_employees
    union all
        select
            employeeid, reportdate
        from 
            withdrawn_employees)

select
      rb.reportdate, rb.EmployeeID
    , rb.startdate, rb.enddate, rb.status
from
    report_base rb
 inner join report_records rr
 on (rb.employeeid, rb.reportdate)
  = (rr.employeeid, rr.reportdate);

why is it better?

As there is no volume test data available, I couldn't check the actual runtime performance differences between the OP's query and the refactored version.

However, the refactored version leads to one less join in the EXPLAIN PLAN which likely translates into performance and memory usage improvements.

In addition to that, the refactored plan is much clearer on how the result data gets computed and allows for a step-wise development/debugging.

SQL How to Extracting different data from two columns

Answers (2)

Structure of the query

the rewritten query

why is it better?

Related Questions