Reputation: 2714
When using the lag function on time series in SQL Server, I always struggle with the first value in a time series. Assume this trivial example
CREATE TABLE demo
([id] int, [time] date, [content] int)
;
INSERT INTO demo (id, time, content) VALUES
(1, '2021-05-31', cast(rand()*1000 as int)),
(2, '2021-06-01', cast(rand()*1000 as int)),
(3, '2021-06-02',cast(rand()*1000 as int)),
(4, '2021-06-03', cast(rand()*1000 as int)),
(5, '2021-06-04', cast(rand()*1000 as int)),
(6, '2021-06-05', cast(rand()*1000 as int)),
(7, '2021-06-06', cast(rand()*1000 as int)),
(8, '2021-06-07', cast(rand()*1000 as int)),
(9, '2021-06-08', cast(rand()*1000 as int));
I want to get all values and their previous value in June, so something like this
select content, lag(content, 1, null) over (order by time)
from demo
where time >= '2021-06-01'
so far so good, however, the first entry will result in null for the previous value.
Of course there are many solutions on how to fill the null value, e.g. subselecting a larger range etc. but for very large tables I somehow think there should be an elegant solution to this.
Sometimes I do stuff like this
select content, lag(content, 1,
(select content from demo d1 join
(select max(time) maxtime from demo where time < '2021-06-01') d2 on d1.time = d2.maxtime
)) over (order by time)
from demo
where time >= '2021-06-01'
Is there something more efficient? (note: of course for this trivial example I doesn't make a difference, but for tables with partition and 500'000'000 entries, one should find the most efficient solution)
Check out the fiddle
Upvotes: 0
Views: 1710
Reputation: 1269873
The key idea is to use a subquery:
select t.*
from (select content, lag(content) over (order by time)
from demo d
) d
where time >= '2021-06-01';
This is probably going to scan the entire table. However, you can create an index demo(time, content)
to help the lag()
.
Next, you can optimize this if you have a reasonable lookback period. For instance, if there are records every month, just go back one month in the subquery:
select t.*
from (select content, lag(content) over (order by time)
from demo d
where time >= '2021-05-01'
) d
where time >= '2021-06-01';
This can also be very important if your data is partitioned -- as large tables are wont to be.
Upvotes: 1
Reputation: 641
What would you like the null values to be? I've put them as 0 in the below example.
SELECT
content,
coalesce(LAG(content, 1, NULL) OVER(
ORDER BY
time
), content-1) lag_content
FROM
demo
WHERE
time >= '2021-06-01'
Output:
content lag_content
-------------------
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
Try it out here: dbfiddle
Upvotes: 0
Reputation: 521289
For this particular case, going by your comments, you may first compute the lag over the entire unfiltered table, then subquery that based on date:
WITH cte AS (
SELECT time, content, LAG(content) OVER (ORDER BY time) lag_content
FROM demo
)
SELECT content, lag_content
FROM cte
WHERE time >= '2021-06-01';
Upvotes: 0