Vytas
Vytas

Reputation: 49

How to delete subsequent matching rows?

I have jsonb datatype, where each row has a name, last_updated, among other keys. How would I go about creating a query, which would leave only 1 row per name per day?

i.e. this:

id | data
1  | {"name": "foo1", "last_updated": "2019-10-06T09:29:30.000Z"}
2  | {"name": "foo1", "last_updated": "2019-10-06T01:29:30.000Z"}
3  | {"name": "foo1", "last_updated": "2019-10-07T01:29:30.000Z"}
4  | {"name": "foo2", "last_updated": "2019-10-06T09:29:30.000Z"}
5  | {"name": "foo2", "last_updated": "2019-10-06T01:29:30.000Z"}
6  | {"name": "foo2", "last_updated": "2019-10-06T02:29:30.000Z"}

becomes:

id | data
1  | {"name": "foo1", "last_updated": "2019-10-06T09:29:30.000Z"}
3  | {"name": "foo1", "last_updated": "2019-10-07T01:29:30.000Z"}
4  | {"name": "foo2", "last_updated": "2019-10-06T09:29:30.000Z"}

This query will run on some 9 million rows, on roughly 300 names.

Upvotes: 2

Views: 37

Answers (1)

zedfoxus
zedfoxus

Reputation: 37109

Try something like this:

Table

create table test (
  id serial,
  data jsonb
);

Data

insert into test (data) values
('{"name": "foo1", "last_updated": "2019-10-06T09:29:30.000Z"}'),
('{"name": "foo1", "last_updated": "2019-10-06T01:29:30.000Z"}'),
('{"name": "foo1", "last_updated": "2019-10-07T01:29:30.000Z"}'),
('{"name": "foo2", "last_updated": "2019-10-06T09:29:30.000Z"}'),
('{"name": "foo2", "last_updated": "2019-10-06T01:29:30.000Z"}'),
('{"name": "foo2", "last_updated": "2019-10-06T02:29:30.000Z"}');

Query

with latest as (
  select data->>'name' as name, max(data->>'last_updated') as last_updated
  from test
  group by data->>'name'
)
delete from test t
where not exists (
  select 1 from latest
  where t.data->>'name' = name
    and t.data->>'last_updated' = last_updated
);

select * from test;

Example

https://dbfiddle.uk/?rdbms=postgres_10&fiddle=2415e6f2c9c7980e69d178a331120dcd

You might have to index your jsonb column like create index on test((data->>'name'));; you could do that for last_updated also.

I make the assumption that a user doesn't have identical last_updated.

If that assumption is not true, you could try this:

with ranking as (
  select 
    row_number() over (partition by data->>'name' order by data->>'last_updated' desc) as sr,
    x.*
  from test x
)
delete from test t
where not exists (
  select 1 from ranking
  where sr = 1
    and id = t.id
);

In this case, we first give a serial number to users' records. Each user's latest_updated time gets sr 1.

Then, we ask the database to delete all records that aren't a match for sr 1's id.

Example: https://dbfiddle.uk/?rdbms=postgres_10&fiddle=dba1879a755ed0ec90580352f82554ee

Upvotes: 2

Related Questions