PostgreSQL: Exclude duplicates as sorted by another key

Question

Consider the following table that stores update history of some attributes of certain objects, organized by effective and published dates:

create table update_history(
  obj_id integer,
  effective date,
  published date,
  attr1 text,
  attr2 integer,
  attr3 boolean,
  primary key(obj_id, effective, published)
);
insert into update_history values
  (1, '2021-01-01', '2021-01-01', 'foo', null, null),
  (1, '2021-01-01', '2021-01-02', null, 1, false),
  (1, '2021-01-02', '2021-01-01', 'foo', 1, false),
  (1, '2021-01-02', '2021-01-02', 'bar', 1, false),
  (1, '2021-01-03', '2021-01-01', 'bar', 1, true),
  (1, '2021-01-04', '2021-01-01', 'bar', 1, true),
  (1, '2021-01-05', '2021-01-01', 'bar', 2, true),
  (1, '2021-01-05', '2021-01-02', 'bar', 1, true),
  (1, '2021-01-05', '2021-01-03', 'bar', 1, true),
  (1, '2021-01-06', '2021-01-04', 'bar', 1, true)
;

I need to write a PostgreSQL query that will simplify the history view for a given obj_id by excluding those update records that did not change any attributes from the immediately preceding update as ordered by effective and published columns. In essence those would be rows ## 6, 9 and 10, marked in italic in the table below:

#	obj_id	effective	published	attr1	attr2	attr3
1	1	2021-01-01	2021-01-01	foo	(null)	(null)
2	1	2021-01-01	2021-01-02	(null)	1	false
3	1	2021-01-02	2021-01-01	foo	1	false
4	1	2021-01-02	2021-01-02	bar	1	false
5	1	2021-01-03	2021-01-01	bar	1	true
6	1	2021-01-04	2021-01-01	bar	1	true
7	1	2021-01-05	2021-01-01	bar	2	true
8	1	2021-01-05	2021-01-02	bar	1	true
9	1	2021-01-05	2021-01-03	bar	1	true
10	1	2021-01-06	2021-01-04	bar	1	true

Keep in mind that in the real life case there are way more attributes to deal with and I don't want the query to get too messy.

The closest I got to the desired result was using the rank window function:

select
  obj_id, effective, published,
  attr1, attr2, attr3
from (
  select *,
    rank() over (
      partition by attr1, attr2, attr3
      order by effective, published
    ) as rank
  from update_history
  where obj_id = 1) as d
where rank = 1
order by effective, published;

That results in this:

obj_id	effective	published	attr1	attr2	attr3
1	2021-01-01	2021-01-01	foo	(null)	(null)
1	2021-01-01	2021-01-02	(null)	1	false
1	2021-01-02	2021-01-01	foo	1	false
1	2021-01-02	2021-01-02	bar	1	false
1	2021-01-03	2021-01-01	bar	1	true
1	2021-01-05	2021-01-01	bar	2	true

As you can see, row #8 from the original table is erroneously excluded, although it changed attr2 from the its previous row, #7. Apparently, the problem is that partitioning is applied before sorting in the window definition.

I wonder if there is another way to accomplish this with a single PostgresSQL query.

user330315 · Accepted Answer

I would use the lag() for this:

select *
from (
  select obj_id, effective, published,
         attr1, attr2, attr3, 
         (attr1, attr2, attr3) is distinct from lag( (attr1,attr2,attr3) ) over (partition by obj_id order by effective, published) as is_different
  from update_history
) t
where is_different

PostgreSQL: Exclude duplicates as sorted by another key

Answers (1)

Related Questions