Reputation: 1651
I would like to perform a window function with a filter clause, for example:
LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC) AS "A_lag_1"
However, Postgres doesn't support this operation but I cant determine how else do it. Details below
Challange
Input tab_A
:
+----+------+------+
| id | type | date |
+----+------+------+
| 1 | A | 30 |
| 1 | A | 25 |
| 1 | A | 20 |
| 1 | B | 29 |
| 1 | B | 28 |
| 1 | B | 21 |
| 1 | C | 24 |
| 1 | C | 22 |
+----+------+------+
Desired output:
+----+------+------+---------+---------+---------+---------+---------+---------+
| id | type | date | A_lag_1 | A_lag_2 | B_lag_1 | B_lag_2 | C_lag_1 | C_lag_2 |
+----+------+------+---------+---------+---------+---------+---------+---------+
| 1 | A | 30 | 25 | 20 | 29 | 28 | 24 | 22 |
| 1 | A | 25 | 20 | | | | 24 | 22 |
| 1 | A | 20 | | | | | | |
| 1 | B | 29 | 25 | 20 | 28 | 21 | 24 | 22 |
| 1 | B | 28 | 25 | 20 | 21 | | 24 | 22 |
| 1 | B | 21 | 20 | | | | 24 | 22 |
| 1 | C | 24 | 20 | | 21 | | 22 | |
| 1 | C | 22 | 20 | | 21 | | | |
+----+------+------+---------+---------+---------+---------+---------+---------+
In words:
date
column)type
('A', 'B', 'C') put the most recent date
in A_lag_1
and the second the most recent (by date) value in A_lag_2
for type
'A', and B_lag_1
, B_lag_2
for 'B' etc.The above example is quite simplified, in my real use case there will be a lot more id
values, more lag column iterations A_lag_X
and types.
Posible solution
This challenge seems a perfect fit for a window function as I want to keep the same number of rows tab_A
and append information which is related to the row but in the past.
So constructing it using a window function (sqlfiddle):
SELECT
id, type, "date",
LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_1",
LAG("date", 2) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_2",
LAG("date", 1) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_1",
LAG("date", 2) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_2",
LAG("date", 1) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_1",
LAG("date", 2) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_2"
FROM tab_A
However, I get the following error:
ERROR: FILTER is not implemented for non-aggregate window functions Position: 30
Although this error is referenced in the documentation I cant determine another way of doing it.
Any help would be much appreciated.
Other SO questions:
max
. However, this won't work when trying to retrieve the 2nd last row, 3rd last row etc.Upvotes: 2
Views: 2495
Reputation: 1651
As the FILTER
clause does work with aggregate functions, I decided to write my own.
----- N = 1
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_1(agg_state point, el float)
returns point
immutable
language plpgsql
as $$
declare
i integer;
stored_value float;
begin
i := agg_state[0];
stored_value := agg_state[1];
i := i + 1; -- First row i=1
if i = 1 then
stored_value := el;
end if;
return point(i, stored_value);
end;
$$;
-- Final function
--DROP FUNCTION lag_agg_ffunc_1(point) CASCADE;
create or replace function lag_agg_ffunc_1(agg_state point)
returns float
immutable
strict
language plpgsql
as $$
begin
return agg_state[1];
end;
$$;
-- Aggregate function
drop aggregate if exists lag_agg_1(double precision);
create aggregate lag_agg_1 (float) (
sfunc = lag_agg_sfunc_1,
stype = point,
finalfunc = lag_agg_ffunc_1,
initcond = '(0,0)'
);
----- N = 2
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_2(agg_state point, el float)
returns point
immutable
language plpgsql
as $$
declare
i integer;
stored_value float;
begin
i := agg_state[0];
stored_value := agg_state[1];
i := i + 1; -- First row i=1
if i = 2 then
stored_value := el;
end if;
return point(i, stored_value);
end;
$$;
-- Final function
--DROP FUNCTION lag_agg_ffunc_2(point) CASCADE;
create or replace function lag_agg_ffunc_2(agg_state point)
returns float
immutable
strict
language plpgsql
as $$
begin
return agg_state[1];
end;
$$;
-- Aggregate function
drop aggregate if exists lag_agg_2(double precision);
create aggregate lag_agg_2 (float) (
sfunc = lag_agg_sfunc_2,
stype = point,
finalfunc = lag_agg_ffunc_2,
initcond = '(0,0)'
);
You can use the above aggregate functions lag_agg_1
and lag_agg_2
with the window expression in the original question:
SELECT
id, type, "date",
NULLIF(lag_agg_1("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_2",
NULLIF(lag_agg_1("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_2",
NULLIF(lag_agg_1("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_1",
NULLIF(lag_agg_2("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_2"
FROM tab_A
ORDER BY id ASC, type, "date" DESC
This executes reasonably quickly compared to the other options. Some things that could be improved:
Any help with the above would be much appreciated
Upvotes: 1
Reputation: 1651
Another possible solution using a lateral join (fiddle):
SELECT
a.id,
a.type,
a."date",
c.nn_row,
c.type,
c."date" as "date_joined"
FROM tab_A AS a
LEFT JOIN LATERAL (
SELECT
type,
"date",
row_number() OVER (PARTITION BY id, type ORDER BY id ASC, "date" DESC) as nn_row
FROM tab_A AS b
WHERE a."date" > b."date"
) AS c on true
WHERE c.nn_row <= 5
This creates a long table like:
+----+------+------+--------+------+-------------+
| id | type | date | nn_row | type | date_joined |
+----+------+------+--------+------+-------------+
| 1 | A | 30 | 1 | A | 25 |
| 1 | A | 30 | 2 | A | 20 |
| 1 | A | 30 | 1 | B | 29 |
| 1 | A | 30 | 2 | B | 28 |
| 1 | A | 30 | 3 | B | 21 |
| 1 | A | 30 | 1 | C | 24 |
| 1 | A | 30 | 2 | C | 22 |
| 1 | A | 25 | 1 | A | 20 |
| 1 | A | 25 | 1 | B | 21 |
| 1 | A | 25 | 1 | C | 24 |
| 1 | A | 25 | 2 | C | 22 |
| 1 | B | 29 | 1 | A | 25 |
| 1 | B | 29 | 2 | A | 20 |
| 1 | B | 29 | 1 | B | 28 |
| 1 | B | 29 | 2 | B | 21 |
| 1 | B | 29 | 1 | C | 24 |
| 1 | B | 29 | 2 | C | 22 |
| 1 | B | 28 | 1 | A | 25 |
| 1 | B | 28 | 2 | A | 20 |
| 1 | B | 28 | 1 | B | 21 |
| 1 | B | 28 | 1 | C | 24 |
| 1 | B | 28 | 2 | C | 22 |
| 1 | B | 21 | 1 | A | 20 |
| 1 | C | 24 | 1 | A | 20 |
| 1 | C | 24 | 1 | B | 21 |
| 1 | C | 24 | 1 | C | 22 |
| 1 | C | 22 | 1 | A | 20 |
| 1 | C | 22 | 1 | B | 21 |
+----+------+------+--------+------+-------------+
After which you can pivot to the desired output.
However, this worked for me on a small sample but on the full table Postgres ran out of disk space (even though I have 50GB available):
ERROR: could not write to hash-join temporary file: No space left on device
I have posted this solution here as it might work for others who have smaller tables
Upvotes: 2
Reputation: 503
You can try something like below.
SELECT
dt.* ,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'A' AND dt.A_lag_1 > b.dateVAL ) AS "A_lag_2",
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'B' AND dt.B_lag_1 > b.dateVAL ) AS "B_lag_2" ,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'C' AND dt.C_lag_1 > b.dateVAL ) AS "C_lag_2"
FROM
(
SELECT
a.id, a.type, a.dateVAL,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'A' AND a.dateVAL > b.dateVAL ) as A_lag_1,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'B' AND a.dateVAL > b.dateVAL ) as B_lag_1,
(SELECT MAX(b.dateVAL) FROM tab_A b WHERE b.type = 'C' AND a.dateVAL > b.dateVAL ) as C_lag_1
FROM tab_A a
) dt
Here is the Fiddle link. this may not be the most efficient way to do that.
Upvotes: 0