Reputation: 5
I have a view EMPLOYEE with report dates that captures employee information on a daily basis over a number of years.
I have written this sql (SAP HANA) as I am a beginner but this is proving to be very very slow and resource intensive and not usable although it works.
I want to capture between two report dates - all active employees on their last date and all withdrawn employees with a end date greater than first report date and first date they were withdrawn.
select E.EMPLOYEEID
, E.STATUS
, E.STARTDATE
, E.ENDDATE
, E.REPORTDATE
FROM
(
SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate
FROM “EMPLOYEETABLE”
WHERE REPORTDATE>= Date’2019-04-01’ AND REPORTDATE<=Date’2019-07-01’ AND STATUS='Active'
GROUP BY EMPLOYEEID
) r
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID =r. EMPLOYEEID AND E.REPORTDATE=r.MaxDate
UNION ALL
select E.EMPLOYEEID
, E.STATUS
, E.STARTDATE
, E.ENDDATE
, E.REPORTDATE
FROM
(
SELECT EMPLOYEEID, Min(REPORTDATE) as MinDate
FROM "EMPLOYEETABLE"
WHERE REPORTDATE>= date’2019-04-01’ AND REPORTDATE<=Date’2019-07-01’ AND STATUS='Withdrawn' AND ENDDATE>= Date’2019-04-01’
GROUP BY EMPLOYEEID
) w
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID =w. EMPLOYEEID AND E.REPORTDATE=w.MinDate
Can anyone help to write this more efficiently, please?
Sample data set
Reportdate EmployeeID startdate enddate status
01/04/2019 Steve 12/02/2012 Null Active
01/04/2019 Don 15/06/2016 Null Active
01/04/2019 John 14/03/2015 01/04/2019 Withdrawn
01/04/2019 Anna 12/05/2017 Null Active
02/04/2019 Steve 12/02/2012 Null Active
02/04/2019 Don 15/06/2016 Null Active
02/04/2019 John 14/03/2015 01/04/2019 Withdrawn
02/04/2019 Anna 12/05/2017 Null Active
03/04/2019 Steve 12/02/2012 Null Active
03/04/2019 Don 15/06/2016 Null Active
03/04/2019 John 14/03/2015 01/04/2019 Withdrawn
03/04/2019 Anna 12/05/2017 03/04/2019 Withdrawn
Desired Output
Reportdate EmployeeID startdate enddate status
03/04/2019 Steve 12/02/2012 Null Active
03/04/2019 Don 15/06/2016 Null Active
01/04/2019 Jon 14/03/2015 01/04/2019 Withdrawn
03/04/2019 Anna 12/05/2017 03/04/2019 Withdrawn
Obviously my dataset is very big as same employee appears every day and new employees added.
Upvotes: 0
Views: 871
Reputation: 10388
Splitting up the query into units that each cover a specific aspect of the report/query-logic as indicated by davidc2p is a very good idea.
But there is no need for temporary tables to do that; common table expressions (CTE) aka "WITH CLAUSE" is sufficient for that.
The really important insight into this is to see that the EMPLOYEETABLE
is a snapshot table with capturing the STATUS
of all employees every single date.
For the query, there is the general condition to consider only snapshots within a certain time frame.
Based on this "timeboxed" dataset, the query now deals with employees (as opposed to snapshots!) and their most recent status (within in the "timeboxed dataset).
This observation allows to easily determine the MAX()
-STATUS
for each employee. Since there is a special condition for "WIDTHDRAWN" employees that their respective ENDDATE
needs to be on or after the start of the reporting time frame, the two distinct employee groups/sets/cohorts need their own subquery.
These two subqueries return for each employee which record (EMPLOYEEID
+REPORTDATE
serves as a unique key to the report-records) should be reported back as the result of the query.
To produce the output, the two employee groups get merged (UNION ALL
) and are then used as the final filter/selector for all records to be returned from the base table.
with report_base as (
-- all records relevant to the reporting timeframe)
select
reportdate, EmployeeID
, startdate, enddate, status
from
employeetable
where
reportdate >= date'2019-04-01'
and reportdate <= date'2019-07-01')
, active_employees as (
-- all employees with most recent status in reporting timeframe = ACTIVE
-- should be DISJUNCT from WITHDRAWN_EMPLOYEES
select
employeeid
, max(reportdate) as reportdate
from
report_base
group by
employeeid
having
max(status)='Active')
, withdrawn_employees as (
-- all employees with most recent status in reporting timefrawm = WITHDRAWN
-- the ENDDATE should be on or after the start of the reporting timeframe
-- should be DISJUNCT from ACTIVE_EMPLOYEES)
select
employeeid
, min(reportdate) as reportdate
from
report_base
where
enddate >= date'2019-04-01'
group by
employeeid
having
max(status)='Withdrawn')
, report_records as(
-- all records that should be returned)
select
employeeid, reportdate
from
active_employees
union all
select
employeeid, reportdate
from
withdrawn_employees)
select
rb.reportdate, rb.EmployeeID
, rb.startdate, rb.enddate, rb.status
from
report_base rb
inner join report_records rr
on (rb.employeeid, rb.reportdate)
= (rr.employeeid, rr.reportdate);
As there is no volume test data available, I couldn't check the actual runtime performance differences between the OP's query and the refactored version.
However, the refactored version leads to one less join in the EXPLAIN PLAN which likely translates into performance and memory usage improvements.
In addition to that, the refactored plan is much clearer on how the result data gets computed and allows for a step-wise development/debugging.
Upvotes: 1
Reputation: 320
I will assume you can create temporary tables and supply with a solution in SqlServer.
I will also assume the following is wrong in your query:
Last line:
E. EMPLOYEEID =r.EMPLOYEEID
is
E. EMPLOYEEID =w.EMPLOYEEID
Second subquery:
AND ENDDATE>= Date’2019-07-01’
is
AND ENDDATE<= Date’2019-07-01’
So to divide, you should store the results of your subquery into temporary storage like this:
IF OBJECT_ID('tempdb..#minreportdate') IS NOT NULL
DROP TABLE #minreportdate
SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate
INTO #minreportdate
FROM EMPLOYEETABLE
WHERE REPORTDATE BETWEEN '2019-04-01' AND '2019-07-01'
AND STATUS = 'Active'
GROUP BY EMPLOYEEID
And
IF OBJECT_ID('tempdb..#maxreportdate') IS NOT NULL
DROP TABLE #maxreportdate
SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate
INTO #maxreportdate
FROM EMPLOYEETABLE
WHERE REPORTDATE BETWEEN '2019-04-01' AND '2019-07-01'
AND STATUS = 'Withdrawn'
GROUP BY EMPLOYEEID
Then execute the main query accessing your two temporary tables:
SELECT E.EMPLOYEEID, E.STATUS, E.STARTDATE, E.ENDDATE, E.REPORTDATE
FROM #minreportdate r
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID = r.EMPLOYEEID
AND E.REPORTDATE = r.MaxDate
UNION ALL
SELECT E.EMPLOYEEID, E.STATUS, E.STARTDATE, E.ENDDATE, E.REPORTDATE
FROM #maxreportdate w
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID = w.EMPLOYEEID
AND E.REPORTDATE = w.MinDate
Upvotes: 0