Snobby
Snobby

Reputation: 1155

Select distinct. Most suitable technique to reduce wait time

I have lots of rows in my table in Postgres DB.

I make insertions in this table every 20 minutes, clear old entries every day and have only 2 selection options from it.

So I want to optimise time, that I wait for my selections.

First selection is kind of:

Select * from table where item=<item_id>

Second is kind of:

Select distinct(datetime) from table

So, in order to optimize 1 selection, I may make indexies for item field. As I understood this technique works perfectly for queries where smth equals smth.

But I don't know how to optimise my 2 selection query. I think smth like partitioning should help me, but there are several types of partitioning and I am a bit confused.

So what is the best option to optimize my queries?

Also, I am using python and Django models. If there is a good library that may do all the dirty job. That would be great. Most suitable now, that I found: http://architect.readthedocs.io/

Edit1 Thanks to Evan Carrol.

Tried to use index for second query. Command:

explain analyze select distinct time_updated from wow_auction_gordunni

Gives:

HashAggregate  (cost=335091.65..335092.51 rows=86 width=8) (actual time=4246.582..4246.607 rows=91 loops=1)
  Group Key: time_updated
  ->  Seq Scan on wow_auction_gordunni  (cost=0.00..313574.92 rows=8606692 width=8) (actual time=0.047..2257.979 rows=8616562 loops=1)
Planning time: 0.080 ms
Execution time: 4246.675 ms

Then creating index and vacuum:

Create INDEX ON wow_auction_gordunni (time_updated);
VACUUM ANALYZE wow_auction_gordunni;
explain analyze select distinct time_updated from wow_auction_gordunni;

Gives following:

Unique  (cost=0.43..249907.42 rows=92 width=8) (actual time=0.057..3537.626 rows=92 loops=1)
  ->  Index Only Scan using wow_auction_gordunni_time_updated_idx on wow_auction_gordunni  (cost=0.43..228163.42 rows=8697599 width=8) (actual time=0.055..2488.408 rows=8696562 loops=1)
        Heap Fetches: 85796
Planning time: 0.726 ms
Execution time: 3537.800 ms

So it seems that index helps a bit (postgres started to use index), but not dramatically.

Upvotes: 1

Views: 261

Answers (2)

Evan Carroll
Evan Carroll

Reputation: 1

It will use an index scan so long as it makes sense. Sample data,

CREATE TABLE foo
AS
  SELECT
    x%3 AS x,
    repeat( md5(x::text)::text, 200 ) AS t1
  FROM generate_series(1,1e6) AS t(x);

CREATE INDEX ON foo (x);
VACUUM ANALYZE foo;

Query,

EXPLAIN ANALYZE SELECT DISTINCT x FROM Foo;
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.42..28480.42 rows=200 width=32) (actual time=0.034..257.734 rows=3 loops=1)
   ->  Index Only Scan using foo_x_idx on foo  (cost=0.42..25980.42 rows=1000000 width=32) (actual time=0.031..122.668 rows=1000000 loops=1)
         Heap Fetches: 0
 Planning time: 0.090 ms
 Execution time: 257.764 ms
(5 rows)

So to optimize your second selection query, create an index on datetime. Check out the EXPLAIN ANALYZE. See if it gets use an an index. If that doesn't help or the index isn't being used you can try to set enable_seqscan = off and then rerunning the query. Now you know what the savings would be, if any. You can paste both plans here and we can take a look at it.

Upvotes: 1

Laurenz Albe
Laurenz Albe

Reputation: 247235

There is no way to optimize the second query, it has to scan the whole table to find all possible values of datetime.

The best you can do is to see that the table is not bloated by clearing the table with TRUNCATE rather than DELETE, and to put enough RAM into the machine so that the whole table is in RAM.

Upvotes: 0

Related Questions