free2use
free2use

Reputation: 387

PostgreSQL/SQL query optimization

So, i have log table with something about 8M records. Because of programming error it happened that there are more than 1 record for company within same date. Now, what i need is to delete all records from this log for each company for same date except latest (which has max id). Count of records to be deleted approximately 300K.

The fastest and easiest thing that i tried is this

delete from indexing_log where id not in (
select max(id)
from indexing_log
group by company_id,
"date"
)

But this query is taking enormous time (about 3 days) on production server (which for some reason doesn't have ssd drive). I tried all ways that i know and need some advice. How can it be faster?

UPDATE I decided to do it in bucket way through celery task.

Upvotes: 2

Views: 137

Answers (4)

Romesh
Romesh

Reputation: 2274

Not Exists is sometimes faster than Not in

delete from indexing_log 
where not exists (select 1
                    from (select max(id) as iid
                            from indexing_log
                           group by company_id,
                                 "date") mids
                   where id = mids.iid
                 )

Upvotes: 1

Clodoaldo Neto
Clodoaldo Neto

Reputation: 125204

Dump the distinct rows to a temporary table

create temporary table t as
select distinct on (company_id, "date") *
from indexing_log
order by company_id, "date", id desc;

Truncate the original

truncate table indexing_log;

Since the table is now empty use the opportunity to do an instantaneous vacuum:

vacuum full indexing_log;

Move the rows from the temporary to the original

insert into indexing_log
select *
from t;

Upvotes: 2

Bruellhusten
Bruellhusten

Reputation: 318

Truncate Table should be much quicker. But there you cannot say "delete everything except..." If it is possible with your data you could write a procedure for that, save your Max IDs into a temptable, trucate the table and write your temptable back. For PostgreSQL the syntax is slighly different (http://www.postgresql.org/docs/9.1/static/sql-selectinto.html)

SELECT * from indexing_log 
INTO #temptable 
WHERE id IN (
    SELECT max(id)
    FROM indexing_log
    GROUP BY company_id,
    "date")

Upvotes: 1

roman
roman

Reputation: 117337

you can try

delete from indexing_log as l
where
    exists
    (
        select *
        from indexing_log as i
        where i.id < l.id and i.company_id = l.company_id and i.dt = l.dt
    );

Upvotes: 2

Related Questions