Maurice
Maurice

Reputation: 149

Detect changes for each ID

Suppose I have the following data

ID   | year_month | Department
1233 | 2020-01-01 | A
1123 | 2020-02-01 | A
1123 | 2020-03-01 | NULL
1123 | 2020-04-01 | B
1123 | 2020-05-01 | B
1123 | 2020-06-01 | B
1123 | 2020-07-01 | NULL
9999 | 2020-01-01 | A
9999 | 2020-02-01 | A
9999 | 2020-03-01 | B
9999 | 2020-04-01 | B
9999 | 2020-05-01 | B
9999 | 2020-06-01 | A
9999 | 2020-07-01 | B

I want to identify the changes in department. , including going to NA/NULL. The desired output is:

ID   | Change_year_month | Old_Department | New_Department
1123 | 2020-03-01        | A              | NULL
1123 | 2020-04-01        | NULL           | B
1123 | 2020-07-01        | B              | NULL
9999 | 2020-03-01        | A              | B
9999 | 2020-06-01        | B              | A

Ideas I've already tried to pursue:

with x as(
SELECT T1.ID, T1.Department, MIN(T1.year_month) AS Change_year_month FROM dbo.Source
GROUP BY T1.ID, Department),
y as (
SELECT ID, year_month, 
rown = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY year_month) FROM x
)
select y.ID, T2.Department, year_month AS Change_year_month FROM y
right join (SELECT T1.ID,
MAX(Department) as Old_Department,
Min(Department) AS New_Department
FROM dbo.Source
GROUP BY T1.ID HAVING COUNT(DISTINCT(Department)) >= 2) T2 on y.ID = T2.ID
where rown = 1 

However, this does not yield the desired result. Whenever a NULL is involved, the query does not see the change. Whenever I change the NULL to something else (like: 'outside the scope'), then the ordering is wrong as the Old_department is never 'outside the scope', but the New_department always is. Also, I feel like the code is inefficient and not durable.

Does anyone have suggestions how to proceed or to construct of durable query?

Upvotes: 0

Views: 36

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270583

Here is a pretty simple method using lag():

select s.id, s.year_month, s.prev_department, s.department
from (select s.*,
             lag(year_month) over (partition id order by year_month) as prev_ym,
             lag(year_month) over (partition id, department order by year_month) as prev_ym_dept,
             lag(department) over (partition by id order by year_month) as prev_department
      from dbo.source s
     ) s
where prev_ym_dept <> prev_ym;

This looks at the dates for the comparison, so it just handles NULL values.

Of course, you can use more complicated comparisons:

select s.id, s.year_month, s.prev_department, s.department
from (select s.*,
             lag(year_month) over (partition id order by year_month) as prev_ym,
             min(year_month) over (partition by id) as min_year_month
      from dbo.source s
     ) s
where prev_department <> department or
      (department is null and
       prev_department is not null
      ) or
      (prev_department is null and
       department is not null and
       year_month <> min_year_month
      )

But that is rather tricky to express. And that might even have a mistake in filtering out the first row.

Upvotes: 1

Related Questions