Two queries with equal results takes vastly different time to execute?

Question

I'm currently using redshift. I was trying to execute a query to calculate a column called id_number (data type INTEGER) from a VARCHAR column called id to speed up further queries using id_number instead of id.

Here is the first query I tried :

select rank() over (order by id) id_number, id, sid1 ,sid2
from table
limit 10000

However, noticing that this query was taking quite some time, I tried the next query:

with A as(
 select id, sid1, sid2 
 from table
 limit 10000
)
select rank() over (order by id) id_number, id, sid1 ,sid2
from A

which was over in a flash.

How was it that the second query took such less time to execute, while the two queries seem to do the exact same thing?

If it is because of the positions of limit 10000, how did the position of limit contribute to the difference in execution time?

Laurenz Albe · Accepted Answer

Your two queries are quite different.

The first one has to sort the complete table to get the rank() and then emits the first 10000 rows of the result (with no particular ordering enforced).

The second one selects 10000 rows (without a particular ordering enforced) and then sorts those to calculate rank() on them.

If the table is significantly larger than 10000 rows, it is unsurprising that the first query, which has to sort it all, is much slower.

Look at the EXPLAIN (ANALYZE, BUFFERS) output to understand this better.

Two queries with equal results takes vastly different time to execute?

Answers (1)

Related Questions