Jazzy
Jazzy

Reputation: 5

SQL How to Extracting different data from two columns

I have a view EMPLOYEE with report dates that captures employee information on a daily basis over a number of years.

I have written this sql (SAP HANA) as I am a beginner but this is proving to be very very slow and resource intensive and not usable although it works.

I want to capture between two report dates - all active employees on their last date and all withdrawn employees with a end date greater than first report date and first date they were withdrawn.

select E.EMPLOYEEID
, E.STATUS
, E.STARTDATE
, E.ENDDATE
, E.REPORTDATE
FROM
(
    SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate  
    FROM “EMPLOYEETABLE”
        WHERE REPORTDATE>= Date’2019-04-01’ AND REPORTDATE<=Date’2019-07-01’ AND STATUS='Active'
        GROUP BY EMPLOYEEID
) r
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID =r. EMPLOYEEID AND E.REPORTDATE=r.MaxDate
UNION ALL
select E.EMPLOYEEID
, E.STATUS
, E.STARTDATE
, E.ENDDATE
, E.REPORTDATE
FROM
(
    SELECT EMPLOYEEID, Min(REPORTDATE) as MinDate  
    FROM "EMPLOYEETABLE"
        WHERE REPORTDATE>= date’2019-04-01’ AND REPORTDATE<=Date’2019-07-01’ AND STATUS='Withdrawn' AND ENDDATE>= Date’2019-04-01’
        GROUP BY EMPLOYEEID
) w
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID =w. EMPLOYEEID AND E.REPORTDATE=w.MinDate

Can anyone help to write this more efficiently, please?

Sample data set

Reportdate  EmployeeID  startdate   enddate     status
01/04/2019  Steve       12/02/2012  Null        Active
01/04/2019  Don         15/06/2016  Null        Active
01/04/2019  John        14/03/2015  01/04/2019  Withdrawn
01/04/2019  Anna        12/05/2017  Null        Active
02/04/2019  Steve       12/02/2012  Null        Active
02/04/2019  Don         15/06/2016  Null        Active
02/04/2019  John        14/03/2015  01/04/2019  Withdrawn
02/04/2019  Anna        12/05/2017  Null        Active
03/04/2019  Steve       12/02/2012  Null        Active
03/04/2019  Don         15/06/2016  Null        Active
03/04/2019  John        14/03/2015  01/04/2019  Withdrawn
03/04/2019  Anna        12/05/2017  03/04/2019  Withdrawn

Desired Output

Reportdate  EmployeeID  startdate   enddate     status
03/04/2019  Steve       12/02/2012  Null        Active
03/04/2019  Don         15/06/2016  Null        Active
01/04/2019  Jon         14/03/2015  01/04/2019  Withdrawn
03/04/2019  Anna        12/05/2017  03/04/2019  Withdrawn

Obviously my dataset is very big as same employee appears every day and new employees added.

Upvotes: 0

Views: 871

Answers (2)

Lars Br.
Lars Br.

Reputation: 10388

Splitting up the query into units that each cover a specific aspect of the report/query-logic as indicated by davidc2p is a very good idea.

But there is no need for temporary tables to do that; common table expressions (CTE) aka "WITH CLAUSE" is sufficient for that.

Structure of the query

The really important insight into this is to see that the EMPLOYEETABLE is a snapshot table with capturing the STATUS of all employees every single date.

For the query, there is the general condition to consider only snapshots within a certain time frame.

Based on this "timeboxed" dataset, the query now deals with employees (as opposed to snapshots!) and their most recent status (within in the "timeboxed dataset).

This observation allows to easily determine the MAX()-STATUS for each employee. Since there is a special condition for "WIDTHDRAWN" employees that their respective ENDDATE needs to be on or after the start of the reporting time frame, the two distinct employee groups/sets/cohorts need their own subquery.

These two subqueries return for each employee which record (EMPLOYEEID+REPORTDATE serves as a unique key to the report-records) should be reported back as the result of the query.

To produce the output, the two employee groups get merged (UNION ALL) and are then used as the final filter/selector for all records to be returned from the base table.

the rewritten query

with report_base as (
-- all records relevant to the reporting timeframe)
    select
        reportdate, EmployeeID
      , startdate, enddate, status
    from
        employeetable
    where 
          reportdate >= date'2019-04-01' 
      and reportdate <= date'2019-07-01')

, active_employees as (
-- all employees with most recent status in reporting timeframe = ACTIVE 
-- should be DISJUNCT from WITHDRAWN_EMPLOYEES
    select
        employeeid
      , max(reportdate) as reportdate
    from 
        report_base
    group by 
        employeeid
    having
        max(status)='Active')

, withdrawn_employees as (
-- all employees with most recent status in reporting timefrawm = WITHDRAWN
-- the ENDDATE should be on or after the start of the reporting timeframe
-- should be DISJUNCT from ACTIVE_EMPLOYEES)
     select
        employeeid
      , min(reportdate) as reportdate
    from 
        report_base
    where 
        enddate >= date'2019-04-01'
    group by 
        employeeid
    having
        max(status)='Withdrawn')

, report_records as(
-- all records that should be returned)
        select
            employeeid, reportdate
        from 
           active_employees
    union all
        select
            employeeid, reportdate
        from 
            withdrawn_employees)

select
      rb.reportdate, rb.EmployeeID
    , rb.startdate, rb.enddate, rb.status
from
    report_base rb
 inner join report_records rr
 on (rb.employeeid, rb.reportdate)
  = (rr.employeeid, rr.reportdate);

why is it better?

As there is no volume test data available, I couldn't check the actual runtime performance differences between the OP's query and the refactored version.

However, the refactored version leads to one less join in the EXPLAIN PLAN which likely translates into performance and memory usage improvements.

In addition to that, the refactored plan is much clearer on how the result data gets computed and allows for a step-wise development/debugging.

Upvotes: 1

davidc2p
davidc2p

Reputation: 320

I will assume you can create temporary tables and supply with a solution in SqlServer.

I will also assume the following is wrong in your query:

Last line:

E. EMPLOYEEID =r.EMPLOYEEID

is

E. EMPLOYEEID =w.EMPLOYEEID

Second subquery:

AND ENDDATE>= Date’2019-07-01’

is

AND ENDDATE<= Date’2019-07-01’

So to divide, you should store the results of your subquery into temporary storage like this:

IF OBJECT_ID('tempdb..#minreportdate') IS NOT NULL
    DROP TABLE #minreportdate

SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate  
INTO #minreportdate
FROM EMPLOYEETABLE
WHERE REPORTDATE BETWEEN '2019-04-01' AND '2019-07-01' 
AND STATUS = 'Active'
GROUP BY EMPLOYEEID

And

IF OBJECT_ID('tempdb..#maxreportdate') IS NOT NULL
    DROP TABLE #maxreportdate

SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate  
INTO #maxreportdate
FROM EMPLOYEETABLE
WHERE REPORTDATE BETWEEN '2019-04-01' AND '2019-07-01' 
AND STATUS = 'Withdrawn'
GROUP BY EMPLOYEEID

Then execute the main query accessing your two temporary tables:

SELECT E.EMPLOYEEID, E.STATUS, E.STARTDATE, E.ENDDATE, E.REPORTDATE
FROM #minreportdate r

    LEFT JOIN "EMPLOYEETABLE" E
    ON  E. EMPLOYEEID   = r.EMPLOYEEID 
    AND E.REPORTDATE    = r.MaxDate

UNION ALL

SELECT E.EMPLOYEEID, E.STATUS, E.STARTDATE, E.ENDDATE, E.REPORTDATE
FROM #maxreportdate w

    LEFT JOIN "EMPLOYEETABLE" E
    ON  E. EMPLOYEEID   = w.EMPLOYEEID 
    AND E.REPORTDATE    = w.MinDate

Upvotes: 0

Related Questions