rst
rst

Reputation: 2714

Get first value outside where window with lag function

When using the lag function on time series in SQL Server, I always struggle with the first value in a time series. Assume this trivial example

CREATE TABLE demo
    ([id] int, [time] date, [content] int)
;
INSERT INTO demo (id, time, content) VALUES
  (1, '2021-05-31', cast(rand()*1000 as int)),
  (2, '2021-06-01', cast(rand()*1000 as int)),
  (3, '2021-06-02',cast(rand()*1000 as int)),
  (4, '2021-06-03', cast(rand()*1000 as int)),
(5, '2021-06-04', cast(rand()*1000 as int)),
(6, '2021-06-05', cast(rand()*1000 as int)),
(7, '2021-06-06', cast(rand()*1000 as int)),
(8, '2021-06-07', cast(rand()*1000 as int)),
(9, '2021-06-08', cast(rand()*1000 as int));

I want to get all values and their previous value in June, so something like this

select content, lag(content, 1, null) over (order by time)
from demo
where time >= '2021-06-01'

so far so good, however, the first entry will result in null for the previous value.

Of course there are many solutions on how to fill the null value, e.g. subselecting a larger range etc. but for very large tables I somehow think there should be an elegant solution to this.

Sometimes I do stuff like this

select content, lag(content, 1, 
(select content from demo d1 join 
(select max(time) maxtime from demo where time < '2021-06-01') d2 on d1.time = d2.maxtime
)) over (order by time)
from demo
where time >= '2021-06-01'

Is there something more efficient? (note: of course for this trivial example I doesn't make a difference, but for tables with partition and 500'000'000 entries, one should find the most efficient solution)

Check out the fiddle

Upvotes: 0

Views: 1710

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1269873

The key idea is to use a subquery:

select t.*
from (select content, lag(content) over (order by time)
      from demo d
     ) d
where time >= '2021-06-01';

This is probably going to scan the entire table. However, you can create an index demo(time, content) to help the lag().

Next, you can optimize this if you have a reasonable lookback period. For instance, if there are records every month, just go back one month in the subquery:

select t.*
from (select content, lag(content) over (order by time)
      from demo d
      where time >= '2021-05-01'
     ) d
where time >= '2021-06-01';

This can also be very important if your data is partitioned -- as large tables are wont to be.

Upvotes: 1

banana_99
banana_99

Reputation: 641

What would you like the null values to be? I've put them as 0 in the below example.

SELECT
    content,
    coalesce(LAG(content, 1, NULL) OVER(
        ORDER BY
            time
    ), content-1) lag_content
FROM
    demo
WHERE
    time >= '2021-06-01'

Output:

content lag_content
-------------------
      2           1
      3           2
      4           3
      5           4
      6           5
      7           6
      8           7
      9           8

Try it out here: dbfiddle

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521289

For this particular case, going by your comments, you may first compute the lag over the entire unfiltered table, then subquery that based on date:

WITH cte AS (
    SELECT time, content, LAG(content) OVER (ORDER BY time) lag_content
    FROM demo
)

SELECT content, lag_content
FROM cte
WHERE time >= '2021-06-01';

Upvotes: 0

Related Questions