Reputation: 25662
consider the following pgSQL statement:
SELECT DISTINCT some_field
FROM some_table
WHERE some_field LIKE 'text%'
LIMIT 10;
Consider also, that some_table consists of several million records, and that some_field has a b-tree index.
Why does the query take so long to execute (several minutes)? What I mean is, why doesnt it loop through creating the result set, and once it gets 10 of them, return the result? It looks like the execution time is the same, regardless of whether or not you include a 'LIMIT 10' or not.
Is this correct or am I missing something? Is there anything I can do to get it to return the first 10 results and 'screw' the rest?
UPDATE: If you drop the distinct, the results are returned virtually instantaneously. I do know however, that many of the some_table records are fairly unique already, and certianly when I run the query without he distinct declaration, the first 10 results are in fact unique. I also eliminated the where clause (eliminating it as a factor). So, my original question still remains, why isnt it terminating as soon as 10 matches are found?
Upvotes: 1
Views: 371
Reputation: 37655
Any time there's an operation that involves aggregation, and "DISTINCT" certainly qualifies, the optimizer is going to do the aggration before even thinking about what's next. And aggration means scanning the whole table (in your case involving a sort, unless there's an index).
But the most likely deal-breaker is that you are grouping on an operation on a column, rather than a plain column value. The optimizer generally disregards a number of possible operations once you are operating on a column transformation of some kind. It's quite possibly not smart enough to know that the ordering of "LIKE 'text%'" and "= 'text'" is the same for grouping purposes.
And remember, you're doing an aggregation on an operation on a column.
Upvotes: 2
Reputation: 5454
Can you post the results of running EXPLAIN on the query? This will reveal what Postgres is doing to execute the query, and is generally the first step in diagnosing query performance problems.
It may be sorting or constructing a hash table of the entire rowset to eliminate the non-distinct records before returning the first row to the LIMIT operator. It makes sense that the engine should be able to read a fraction of the records, returning one new distinct at a time until the LIMIT clause has satisfied its 10 quota, but there may not be an operator implemented to make that work.
Is the some_field unique? If not, it would be useless in locating distinct records. If it is, then the DISTINCT clause would be unnecessary, since that index already guarantees that each row is unique on some_field.
Upvotes: 3
Reputation: 20367
how big is the table? do you have any indexes on the table? check your query execution plan to determine if it's doing a table scan, an index scan, or an index seek. if it's doing a table scan then you most likely dont have any indexes.
try putting an index on the field your filtering by and/or the field your selecting.
Upvotes: 0
Reputation: 35171
You have a DISTINCT. This means that to find 10 distinct rows, it's necessary to scan all rows that match the predicate until 10 different some_fields are found.
Depending on your indices, the query optimizer may decide that scanning all rows is the best way to do this.
10 distinct rows could represent 10, a million, an infinity of non-distinct rows.
Upvotes: 8
Reputation: 112406
I'm suspicious it's because you don't have an ORDER BY. Without ordering, you might have to cruise a whole lot of records to get 10 that meet your criterion.
Upvotes: -1