Reputation: 5605
I'm doing the following two queries quite frequently on a table that essentially gathers up logging information. Both select distinct values from a huge number of rows but with less than 10 different values in those.
I've analyzed both "distinct" queries done by the page:
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=1070734.05..1070734.11 rows=6 width=21)
-> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023050.24 rows=19073524 width=21)
(2 rows)
marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=1070735.34..1070735.39 rows=5 width=13)
-> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023051.47 rows=19073547 width=13)
(2 rows)
Both do sequence scans of the columns. However if I turn off enable_seqscan (dispite the name this only disables doing sequence scans on columns with indices) the query uses the index, but is even slower:
marchena=> set enable_seqscan = off;
SET
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..19613740.62 rows=6 width=21)
-> Index Scan using audit_bundle_idx on audit_records auditrecor0_ (cost=0.00..19566056.69 rows=19073570 width=21)
(2 rows)
marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..45851449.96 rows=5 width=13)
-> Index Scan using audit_server_idx on audit_records auditrecor0_ (cost=0.00..45803766.04 rows=19073570 width=13)
(2 rows)
Both bundle_id and server_name columns have btree indices on them, should I be using a different type of index to make selecting distinct values fast?
Upvotes: 10
Views: 7711
Reputation: 2147
On PostgreSQL 9.3, starting from the answer from Denis:
select bundles.bundle_id
from bundles
where exists (
select 1 from audit_records
where audit_records.bundle_id = bundles.bundle_id
);
just by adding a 'limit 1' to the subquery, I got a 60x speedup (for my use case, with 8 million records, a composite index and 10k combinations), going from 1800ms to 30ms:
select bundles.bundle_id
from bundles
where exists (
select 1 from audit_records
where audit_records.bundle_id = bundles.bundle_id limit 1
);
Upvotes: 1
Reputation: 4774
I have the same problem with tables > 300 millions records and an indexed field with a few distinct values. I couldn't get rid of the seq scan so I made this function to simulate a distinct search using the index if it exists. If your table has a number of distinct values proportional to the total number of records, this function isn't good. It also has to be adjusted for multi-columns distinct values. Warning: This function is wide open to sql injection and should only be used in a securized environment.
Explain analyze results:
Query with normal SELECT DISTINCT: Total runtime: 598310.705 ms
Query with SELECT small_distinct(...): Total runtime: 1.156 ms
CREATE OR REPLACE FUNCTION small_distinct(
tableName varchar, fieldName varchar, sample anyelement = ''::varchar)
-- Search a few distinct values in a possibly huge table
-- Parameters: tableName or query expression, fieldName,
-- sample: any value to specify result type (defaut is varchar)
-- Author: T.Husson, 2012-09-17, distribute/use freely
RETURNS TABLE ( result anyelement ) AS
$BODY$
BEGIN
EXECUTE 'SELECT '||fieldName||' FROM '||tableName||' ORDER BY '||fieldName
||' LIMIT 1' INTO result;
WHILE result IS NOT NULL LOOP
RETURN NEXT;
EXECUTE 'SELECT '||fieldName||' FROM '||tableName
||' WHERE '||fieldName||' > $1 ORDER BY ' || fieldName || ' LIMIT 1'
INTO result USING result;
END LOOP;
END;
$BODY$ LANGUAGE plpgsql VOLATILE;
Call samples:
SELECT small_distinct('observations','id_source',1);
SELECT small_distinct('(select * from obs where id_obs > 12345) as temp',
'date_valid','2000-01-01'::timestamp);
SELECT small_distinct('addresses','state');
Upvotes: 4
Reputation: 11591
BEGIN;
CREATE TABLE dist ( x INTEGER NOT NULL );
INSERT INTO dist SELECT random()*50 FROM generate_series( 1, 5000000 );
COMMIT;
CREATE INDEX dist_x ON dist(x);
VACUUM ANALYZE dist;
EXPLAIN ANALYZE SELECT DISTINCT x FROM dist;
HashAggregate (cost=84624.00..84624.51 rows=51 width=4) (actual time=1840.141..1840.153 rows=51 loops=1)
-> Seq Scan on dist (cost=0.00..72124.00 rows=5000000 width=4) (actual time=0.003..573.819 rows=5000000 loops=1)
Total runtime: 1848.060 ms
PG can't (yet) use an index for distinct (skipping the identical values) but you can do this :
CREATE OR REPLACE FUNCTION distinct_skip_foo()
RETURNS SETOF INTEGER
LANGUAGE plpgsql STABLE
AS $$
DECLARE
_x INTEGER;
BEGIN
_x := min(x) FROM dist;
WHILE _x IS NOT NULL LOOP
RETURN NEXT _x;
_x := min(x) FROM dist WHERE x > _x;
END LOOP;
END;
$$ ;
EXPLAIN ANALYZE SELECT * FROM distinct_skip_foo();
Function Scan on distinct_skip_foo (cost=0.00..260.00 rows=1000 width=4) (actual time=1.629..1.635 rows=51 loops=1)
Total runtime: 1.652 ms
Upvotes: 15
Reputation: 78561
You're selecting distinct values from the whole table, which automatically leads to a seq scan. You've millions rows, so it'll necessarily be slow.
There's a trick to get the distinct values faster, but it only works when the data has a known (and reasonably small) set of possible values. For instance, I take it that your bundle_id references some kind of bundles table which is a smaller. This means you can write:
select bundles.bundle_id
from bundles
where exists (
select 1 from audit_records
where audit_records.bundle_id = bundles.bundle_id
);
This should lead to a nested loop / seq scan on bundles -> index scan on audit_records using the index on bundle_id.
Upvotes: 8