Reputation: 387
So, i have log table with something about 8M records. Because of programming error it happened that there are more than 1 record for company within same date. Now, what i need is to delete all records from this log for each company for same date except latest (which has max id). Count of records to be deleted approximately 300K.
The fastest and easiest thing that i tried is this
delete from indexing_log where id not in (
select max(id)
from indexing_log
group by company_id,
"date"
)
But this query is taking enormous time (about 3 days) on production server (which for some reason doesn't have ssd drive). I tried all ways that i know and need some advice. How can it be faster?
UPDATE I decided to do it in bucket way through celery task.
Upvotes: 2
Views: 137
Reputation: 2274
Not Exists
is sometimes faster than Not in
delete from indexing_log
where not exists (select 1
from (select max(id) as iid
from indexing_log
group by company_id,
"date") mids
where id = mids.iid
)
Upvotes: 1
Reputation: 125204
Dump the distinct rows to a temporary table
create temporary table t as
select distinct on (company_id, "date") *
from indexing_log
order by company_id, "date", id desc;
Truncate the original
truncate table indexing_log;
Since the table is now empty use the opportunity to do an instantaneous vacuum
:
vacuum full indexing_log;
Move the rows from the temporary to the original
insert into indexing_log
select *
from t;
Upvotes: 2
Reputation: 318
Truncate Table should be much quicker. But there you cannot say "delete everything except..." If it is possible with your data you could write a procedure for that, save your Max IDs into a temptable, trucate the table and write your temptable back. For PostgreSQL the syntax is slighly different (http://www.postgresql.org/docs/9.1/static/sql-selectinto.html)
SELECT * from indexing_log
INTO #temptable
WHERE id IN (
SELECT max(id)
FROM indexing_log
GROUP BY company_id,
"date")
Upvotes: 1
Reputation: 117337
you can try
delete from indexing_log as l
where
exists
(
select *
from indexing_log as i
where i.id < l.id and i.company_id = l.company_id and i.dt = l.dt
);
Upvotes: 2