Fares Dellel
Fares Dellel

Reputation: 23

View to get the minimum date with a complicated condition

I have a table in SQL Server like this:

+----------+-----------+------------+
| DateFrom | Completed | EmployeeID |
+----------+-----------+------------+
DateFrom: date not null -- unique for each EmployeeID
Completed: bit not null
EmployeeID: bigint not null

I want to create a view that will return the start date of the last period for every EmployeeID like this:

  1. If there is no Completed is true, get the minimum DateFrom. [The employee has one period which is still not completed]
+----------+-----------+------------+
| DateFrom | Completed | EmployeeID |
+----------+-----------+------------+
|2021-01-01|   false   |     1      |
|2021-01-05|   false   |     1      |
|2021-01-09|   false   |     1      |
|2021-01-10|   false   |     1      |
|2021-01-07|   false   |     2      |
|2021-01-15|   false   |     2      |
+----------+-----------+------------+

Expected Result:
2021-01-01 for EmployeeID = 1
2021-01-07 for EmployeeID = 2
  1. Else, return the minimum DateFrom after the last Completed is true. [The last period is still not completed]
+----------+-----------+------------+
| DateFrom | Completed | EmployeeID |
+----------+-----------+------------+
|2021-01-01|   false   |     1      |
|2021-01-05|   true    |     1      |
|2021-01-09|   false   |     1      |
|2021-01-10|   false   |     1      |
|2021-01-07|   true    |     2      |
|2021-01-15|   false   |     2      |
+----------+-----------+------------+

Expected Result:
2021-01-09 for EmployeeID = 1
2021-01-15 for EmployeeID = 2
  1. If the maximum DateFrom has Completed=true, return the minimum DateFrom before the last Completed is true and after the true before it, if exists. [The last period is completed with multiple subperiods]
+----------+-----------+------------+
| DateFrom | Completed | EmployeeID |
+----------+-----------+------------+
|2021-01-01|   false   |     1      |
|2021-01-05|   true    |     1      |
|2021-01-09|   false   |     1      |
|2021-01-10|   true    |     1      |
|2021-01-07|   false   |     2      |
|2021-01-15|   true    |     2      |
+----------+-----------+------------+

Expected Result:
2021-01-09 for EmployeeID = 1
2021-01-07 for EmployeeID = 2
  1. If the maximum DateFrom has Completed=true and there is no other rows or the row before it has Completed=true, return the maximum DateFrom. [The last period is completed with one subperiod]
+----------+-----------+------------+
| DateFrom | Completed | EmployeeID |
+----------+-----------+------------+
|2021-01-01|   false   |     1      |
|2021-01-05|   false   |     1      |
|2021-01-09|   true    |     1      |
|2021-01-10|   true    |     1      |
|2021-01-07|   true    |     2      |
+----------+-----------+------------+

Expected Result:
2021-01-10 for EmployeeID = 1
2021-01-07 for EmployeeID = 2

I am looking for the most optimized solution.

I tried this, but I get a NULL value in the third example:

WITH T AS (
    SELECT EmployeeID
        , MAX(CASE WHEN Completed = 0 THEN NULL ELSE DateFrom END) MaxDateFrom 
    FROM TableDates
    GROUP BY EmployeeID
)
SELECT TableDates.EmployeeID, MIN(TableDates.DateFrom) DateFrom
FROM T
LEFT JOIN TableDates ON T.EmployeeID = TableDates.EmployeeID
    AND (T.MaxDateFrom IS NULL OR TableDates.DateFrom > T.MaxDateFrom)
GROUP BY TableDates.EmployeeID

Upvotes: 2

Views: 285

Answers (2)

Dale K
Dale K

Reputation: 27226

Here is a working query. Its probably over complicated, but I leave the simplification to you.

It handles the 3 cases, all partitioned by EmployeeId as requested, as follows:

  1. When no Completed=1 exists, detected using the sum(Completed) over(), then first_value(DateFrom) is used.

  2. When the last row value is completed=1 and the preceding row is completed=0, detected using last_value(Completed) and lag(Completed) then max(case when Completed = 0 then DateFrom else null end) is used.

  3. The tricky case, when a Completed=1 exists and its not last. In this case the DateFrom of the most recent row where Completed=1 is found and then the min(DateFrom) is found for all rows more recent than the row previously detected, until the preceding Completed=1.

  4. If the last row has completed=1 and the second to last row has completed=1 then use the DateFrom from the last row. Coalesce ensures this if all other options are null.

insert into @Test (EmployeeId, DateFrom, Completed)
values
-- Scenario 1
(1, '2021-01-01', 0),
(1, '2021-01-02', 0),
(1, '2021-01-03', 0),
-- Scenario 2
(2, '2021-01-01', 0),
(2, '2021-01-02', 1),
(2, '2021-01-03', 0),
(2, '2021-01-04', 0),
-- Scenario 3
(3, '2021-01-01', 0),
(3, '2021-01-02', 1),
(3, '2021-01-03', 0),
(3, '2021-01-04', 1),
-- Special case, single row
(4, '2021-01-01', 1),
-- Scenario 4
(5, '2021-01-01', 0),
(5, '2021-01-02', 0),
(5, '2021-01-03', 1);

with cte as (
  select *
    -- First value of DateFrom over all rows (not the default)
    , first_value (DateFrom) over (partition by EmployeeId order by DateFrom asc rows between unbounded preceding and unbounded following) FirstDateFrom
    -- Last value of Completed over all rows (not the default)
    , last_value (Completed) over (partition by EmployeeId order by DateFrom asc rows between unbounded preceding and unbounded following) LastCompleted
    -- Find the Date of the last row with Completed = 1
    , max (case when Completed = 1 then DateFrom else null end) over (partition by EmployeeId order by DateFrom asc rows between unbounded preceding and unbounded following) LastCompletedNew
    -- Regular row number
    , row_number() over (partition by EmployeeId order by DateFrom desc) RowNumber
    -- Total number of rows with Completed = 1
    , sum(convert(int,Completed)) over (partition by EmployeeId) SumOfCompleted
    -- Max value of DateFrom where Completed = 0
    , max(case when Completed = 0 then DateFrom else null end) over (partition by EmployeeId order by DateFrom asc rows between unbounded preceding and unbounded following) MaxDateFrom
    -- Check the lagged complete to see if the last 2 rows are completed = 1
    , lag(Completed) over (partition by EmployeeId order by DateFrom asc) LaggedComplete
    -- Borrowed from Gordon to check which rows are prior to the last Completed = 1 and before the preceding Completed = 1
    , sum(case when completed = 1 then 1 else 0 end) over (partition by employeeid order by datefrom desc) as completed_seqnum
  from @Test
)
select
  EmployeeId
  -- Use the only DateFrom if there is only one
  , coalesce(case
    -- Scenario 1
    when SumOfCompleted = 0 then FirstDateFrom
    when LastCompleted = 1 then
      case
      -- Scenario 4
      when coalesce(LaggedComplete,0) = 1 then DateFrom
      -- Scenario 3
      else Scenario3
      end
    -- Scenario 2
    else ActualResult
    end, DateFrom) FinalResult
  --, * -- Uncomment for working
from (
  select *
    -- Find the lowest DateFrom which is greater then the DateFrom of the last row where Completed = 1
    , min(case when DateFrom > LastCompletedNew then DateFrom else null end) over (partition by EmployeeId) ActualResult
    -- Find the min DateFrom over the rows between the last Completed=1 and the Completed=1 before it (if it exists)
    , min(case when completed_seqnum = 1 then DateFrom else null end) over (partition by EmployeeId order by DateFrom asc rows between unbounded preceding and unbounded following) Scenario3
  from cte
) x
-- Because we have calculated the same result for every row we just take the first
where RowNumber = 1
order by x.EmployeeId asc, x.DateFrom asc;

Note: This assumes that there is only a single row per date.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269773

I think you just want conditional aggregation -- with a bunch of logic. Assuming that you have rows for every day, I think this does what you want:

select employeeid,
       (case when -- case 4
                  min(completed) = max(completed) and
                  min(completed) = 'true'
             then max(datefrom) 
             when -- case 1
                  min(completed) = max(completed) and
                  min(completed) = 'false'
             then min(datefrom) 
             when -- case 3
                  max(datefrom) = max(case when completed = 'true' then datefrom end)
             then min(case when completed_seqnum = 1 then datefrom end)
             else dateadd(day, 1, max(case when completed = 'true' then datefrom end))
        end)
from (select t.*,
             sum(case when completed = 'true' then 1 else 0 end) over (partition by employeeid order by datefrom desc) as completed_seqnum
      from t
     ) t
group by employeeid;

The need for a row every day is really just a convenience -- allowing the code to add one day to get the date after a particular "true" false, for instance. This could also be accomplished using lead() in the subquery.

Note: This does not handle all conditions (at least with a non-NULL date. For instance, it returns NULL when there is a sequence of "true"s at the end of the data.

If this is an issue -- this version of your question has been asked. Ask a new question with appropriate sample data and desired results. I also think that you might be able to explain the problem you are trying to solve and simplify the explanation.

EDIT:

If dates are missing, you can use:

select employeeid,
       (case when -- case 4
                  min(completed) = max(completed) and
                  min(completed) = 'true'
             then max(datefrom) 
             when -- case 1
                  min(completed) = max(completed) and
                  min(completed) = 'false'
             then min(datefrom) 
             when -- case 3
                  max(datefrom) = max(case when completed = 'true' then datefrom end)
             then min(case when completed_seqnum = 1 then datefrom end)
             else max(case when completed = 'true' then next_datefrom end)
        end)
from (select t.*,
             lead(datefrom) over (partition by employeeid order by datefrom) as next_datefrom,
             sum(case when completed = 'true' then 1 else 0 end) over (partition by employeeid order by datefrom desc) as completed_seqnum
      from t
     ) t
group by employeeid;

Upvotes: 1

Related Questions