Reputation: 49
I have jsonb datatype, where each row has a name, last_updated, among other keys. How would I go about creating a query, which would leave only 1 row per name per day?
i.e. this:
id | data
1 | {"name": "foo1", "last_updated": "2019-10-06T09:29:30.000Z"}
2 | {"name": "foo1", "last_updated": "2019-10-06T01:29:30.000Z"}
3 | {"name": "foo1", "last_updated": "2019-10-07T01:29:30.000Z"}
4 | {"name": "foo2", "last_updated": "2019-10-06T09:29:30.000Z"}
5 | {"name": "foo2", "last_updated": "2019-10-06T01:29:30.000Z"}
6 | {"name": "foo2", "last_updated": "2019-10-06T02:29:30.000Z"}
becomes:
id | data
1 | {"name": "foo1", "last_updated": "2019-10-06T09:29:30.000Z"}
3 | {"name": "foo1", "last_updated": "2019-10-07T01:29:30.000Z"}
4 | {"name": "foo2", "last_updated": "2019-10-06T09:29:30.000Z"}
This query will run on some 9 million rows, on roughly 300 names.
Upvotes: 2
Views: 37
Reputation: 37109
Try something like this:
Table
create table test (
id serial,
data jsonb
);
Data
insert into test (data) values
('{"name": "foo1", "last_updated": "2019-10-06T09:29:30.000Z"}'),
('{"name": "foo1", "last_updated": "2019-10-06T01:29:30.000Z"}'),
('{"name": "foo1", "last_updated": "2019-10-07T01:29:30.000Z"}'),
('{"name": "foo2", "last_updated": "2019-10-06T09:29:30.000Z"}'),
('{"name": "foo2", "last_updated": "2019-10-06T01:29:30.000Z"}'),
('{"name": "foo2", "last_updated": "2019-10-06T02:29:30.000Z"}');
Query
with latest as (
select data->>'name' as name, max(data->>'last_updated') as last_updated
from test
group by data->>'name'
)
delete from test t
where not exists (
select 1 from latest
where t.data->>'name' = name
and t.data->>'last_updated' = last_updated
);
select * from test;
Example
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=2415e6f2c9c7980e69d178a331120dcd
You might have to index your jsonb column like create index on test((data->>'name'));
; you could do that for last_updated also.
I make the assumption that a user doesn't have identical last_updated.
If that assumption is not true, you could try this:
with ranking as (
select
row_number() over (partition by data->>'name' order by data->>'last_updated' desc) as sr,
x.*
from test x
)
delete from test t
where not exists (
select 1 from ranking
where sr = 1
and id = t.id
);
In this case, we first give a serial number to users' records. Each user's latest_updated time gets sr 1.
Then, we ask the database to delete all records that aren't a match for sr 1's id.
Example: https://dbfiddle.uk/?rdbms=postgres_10&fiddle=dba1879a755ed0ec90580352f82554ee
Upvotes: 2