joshlk
joshlk

Reputation: 1651

Take first, second, third ... last value and selecting rows (Window function with filter and lag)

I would like to perform a window function with a filter clause, for example:

LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC) AS "A_lag_1"

However, Postgres doesn't support this operation but I cant determine how else do it. Details below

Challange

Input tab_A:

+----+------+------+
| id | type | date |
+----+------+------+
|  1 | A    |   30 |
|  1 | A    |   25 |
|  1 | A    |   20 |
|  1 | B    |   29 |
|  1 | B    |   28 |
|  1 | B    |   21 |
|  1 | C    |   24 |
|  1 | C    |   22 |
+----+------+------+

Desired output:

+----+------+------+---------+---------+---------+---------+---------+---------+
| id | type | date | A_lag_1 | A_lag_2 | B_lag_1 | B_lag_2 | C_lag_1 | C_lag_2 |
+----+------+------+---------+---------+---------+---------+---------+---------+
|  1 | A    |   30 |      25 |      20 |      29 |      28 |      24 |      22 |
|  1 | A    |   25 |      20 |         |         |         |      24 |      22 |
|  1 | A    |   20 |         |         |         |         |         |         |
|  1 | B    |   29 |      25 |      20 |      28 |      21 |      24 |      22 |
|  1 | B    |   28 |      25 |      20 |      21 |         |      24 |      22 |
|  1 | B    |   21 |      20 |         |         |         |      24 |      22 |
|  1 | C    |   24 |      20 |         |      21 |         |      22 |         |
|  1 | C    |   22 |      20 |         |      21 |         |         |         |
+----+------+------+---------+---------+---------+---------+---------+---------+

In words:

The above example is quite simplified, in my real use case there will be a lot more id values, more lag column iterations A_lag_X and types.

Posible solution This challenge seems a perfect fit for a window function as I want to keep the same number of rows tab_A and append information which is related to the row but in the past.

So constructing it using a window function (sqlfiddle):

SELECT
  id, type, "date",
  LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_1",
  LAG("date", 2) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_2",
  LAG("date", 1) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_1",
  LAG("date", 2) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_2",
  LAG("date", 1) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_1",
  LAG("date", 2) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_2"
FROM tab_A

However, I get the following error:

ERROR: FILTER is not implemented for non-aggregate window functions Position: 30

Although this error is referenced in the documentation I cant determine another way of doing it.

Any help would be much appreciated.


Other SO questions:

Upvotes: 2

Views: 2495

Answers (3)

joshlk
joshlk

Reputation: 1651

As the FILTER clause does work with aggregate functions, I decided to write my own.

----- N = 1
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_1(agg_state point, el float)
    returns point
    immutable
    language plpgsql
    as $$
declare
    i integer;
    stored_value float;
begin
    i := agg_state[0];
    stored_value := agg_state[1];

    i := i + 1; -- First row i=1
    if i = 1 then
        stored_value := el;
    end if;
    return point(i, stored_value);
end;
$$;

-- Final function
--DROP FUNCTION lag_agg_ffunc_1(point) CASCADE;
create or replace function lag_agg_ffunc_1(agg_state point)
    returns float
    immutable
    strict
    language plpgsql
    as $$
begin
  return agg_state[1];
end;
$$;

-- Aggregate function
drop aggregate if exists lag_agg_1(double precision);
create aggregate lag_agg_1 (float) (
    sfunc = lag_agg_sfunc_1,
    stype = point,
    finalfunc = lag_agg_ffunc_1,
    initcond = '(0,0)'
);


----- N = 2
-- State transition function
-- agg_state: the current state, el: new element
create or replace function lag_agg_sfunc_2(agg_state point, el float)
    returns point
    immutable
    language plpgsql
    as $$
declare
    i integer;
    stored_value float;
begin
    i := agg_state[0];
    stored_value := agg_state[1];

    i := i + 1; -- First row i=1
    if i = 2 then
        stored_value := el;
    end if;
    return point(i, stored_value);
end;
$$;

-- Final function
--DROP FUNCTION lag_agg_ffunc_2(point) CASCADE;
create or replace function lag_agg_ffunc_2(agg_state point)
    returns float
    immutable
    strict
    language plpgsql
    as $$
begin
  return agg_state[1];
end;
$$;

-- Aggregate function
drop aggregate if exists lag_agg_2(double precision);
create aggregate lag_agg_2 (float) (
    sfunc = lag_agg_sfunc_2,
    stype = point,
    finalfunc = lag_agg_ffunc_2,
    initcond = '(0,0)'
);

You can use the above aggregate functions lag_agg_1 and lag_agg_2 with the window expression in the original question:

SELECT
  id, type, "date",
  NULLIF(lag_agg_1("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_1",
  NULLIF(lag_agg_2("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_2",
  NULLIF(lag_agg_1("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_1",
  NULLIF(lag_agg_2("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_2",
  NULLIF(lag_agg_1("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_1",
  NULLIF(lag_agg_2("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_2"
FROM tab_A
ORDER BY id ASC, type, "date" DESC

This executes reasonably quickly compared to the other options. Some things that could be improved:

  • I couldn't determine how to work with null values properly and so at the end fudging the result by converting all 0's to NULLs. This will cause problems in certain situations
  • I have just copied and pasted the functions for each lag_X as I couldn't determine how to parameterise this

Any help with the above would be much appreciated

Upvotes: 1

joshlk
joshlk

Reputation: 1651

Another possible solution using a lateral join (fiddle):

SELECT
    a.id,
    a.type,
    a."date",
    c.nn_row,
    c.type,
    c."date" as "date_joined"
FROM tab_A AS a
LEFT JOIN LATERAL (
    SELECT
        type,
        "date",
        row_number() OVER (PARTITION BY id, type ORDER BY id ASC, "date" DESC) as nn_row
    FROM tab_A AS b
    WHERE a."date" > b."date"
) AS c on true
WHERE c.nn_row <= 5

This creates a long table like:

+----+------+------+--------+------+-------------+
| id | type | date | nn_row | type | date_joined |
+----+------+------+--------+------+-------------+
|  1 | A    |   30 |      1 | A    |          25 |
|  1 | A    |   30 |      2 | A    |          20 |
|  1 | A    |   30 |      1 | B    |          29 |
|  1 | A    |   30 |      2 | B    |          28 |
|  1 | A    |   30 |      3 | B    |          21 |
|  1 | A    |   30 |      1 | C    |          24 |
|  1 | A    |   30 |      2 | C    |          22 |
|  1 | A    |   25 |      1 | A    |          20 |
|  1 | A    |   25 |      1 | B    |          21 |
|  1 | A    |   25 |      1 | C    |          24 |
|  1 | A    |   25 |      2 | C    |          22 |
|  1 | B    |   29 |      1 | A    |          25 |
|  1 | B    |   29 |      2 | A    |          20 |
|  1 | B    |   29 |      1 | B    |          28 |
|  1 | B    |   29 |      2 | B    |          21 |
|  1 | B    |   29 |      1 | C    |          24 |
|  1 | B    |   29 |      2 | C    |          22 |
|  1 | B    |   28 |      1 | A    |          25 |
|  1 | B    |   28 |      2 | A    |          20 |
|  1 | B    |   28 |      1 | B    |          21 |
|  1 | B    |   28 |      1 | C    |          24 |
|  1 | B    |   28 |      2 | C    |          22 |
|  1 | B    |   21 |      1 | A    |          20 |
|  1 | C    |   24 |      1 | A    |          20 |
|  1 | C    |   24 |      1 | B    |          21 |
|  1 | C    |   24 |      1 | C    |          22 |
|  1 | C    |   22 |      1 | A    |          20 |
|  1 | C    |   22 |      1 | B    |          21 |
+----+------+------+--------+------+-------------+

After which you can pivot to the desired output.

However, this worked for me on a small sample but on the full table Postgres ran out of disk space (even though I have 50GB available):

ERROR: could not write to hash-join temporary file: No space left on device

I have posted this solution here as it might work for others who have smaller tables

Upvotes: 2

Prabhath Amaradasa
Prabhath Amaradasa

Reputation: 503

You can try something like below.

SELECT
dt.* ,
(SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'A' AND dt.A_lag_1 >  b.dateVAL  ) AS "A_lag_2",
(SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'B' AND dt.B_lag_1 >  b.dateVAL  ) AS "B_lag_2" ,
(SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'C' AND dt.C_lag_1 >  b.dateVAL  ) AS "C_lag_2"
FROM
(
SELECT
  a.id, a.type, a.dateVAL,
 (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'A' AND a.dateVAL >  b.dateVAL  )  as A_lag_1,
 (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'B' AND a.dateVAL >  b.dateVAL  )  as B_lag_1,
 (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'C' AND a.dateVAL >  b.dateVAL  )  as C_lag_1
FROM tab_A a
)   dt

Here is the Fiddle link. this may not be the most efficient way to do that.

Upvotes: 0

Related Questions