user12562215
user12562215

Reputation: 145

Redshift SQL Optimization - UNION ALL and DISTINCT

Suppose I have a table with a large volume. Which of the following would be more optimized?

SELECT DISTINCT shipping_address_id AS address_id FROM table_A
UNION
SELECT DISTINCT billing_address_id AS address_id FROM table_A

vs.

SELECT DISTINCT address_id FROM
(
SELECT shipping_address_id AS address_id FROM table_A
UNION ALL 
SELECT billing_address_id AS address_id FROM table_A
)

Upvotes: 0

Views: 3073

Answers (3)

deroby
deroby

Reputation: 6002

As with all things SQL, the only true answer is : "It depends."

You have to keep in mind that with (most if not all) SQL implementations, the system will return you the information that you're asking for but you have no real way of saying how it should get you that information. The query optimizer will look at your query, look at the data available (via the statistics), the indexes available, the resources available, etc and come up with a plan on how to fetch you the data in what is hopefully the most efficient way.

As such it may very well be that 2 queries looking quite different will still result in the same query plan being executed in the back. Or that two identical queries will result in totally different plans being executed. (e.g. in test it's like this because you have 100 records, in production it's like that because you have millions of records)

You also need to know that UNION actually is a UNION ALL operation followed by a DISTINCT operation.

So, if you do :

SELECT DISTINCT address_id FROM
(
    SELECT shipping_address_id AS address_id FROM table_A
    UNION ALL 
    SELECT billing_address_id AS address_id FROM table_A
)

this is actually identical to doing

SELECT shipping_address_id AS address_id FROM table_A
UNION 
SELECT billing_address_id AS address_id FROM table_A

You can run your query with 'EXPLAIN ANALYZE' to confirm this.

The question on whether you should 'pre-DISTINCT' the two SELECTs before UNION-ing them together is tricky and may depend on what your data is like. Assuming both fields are indexed then a DISTINCT on them separately may be quite fast and if the resulting (distinct) lists are remarkably smaller than what was there in the first place then the overhead of doing 2 + 1 distinct operations might be faster than running a distinct over the combined list. However, if both starting lists are nearly unique to start with (and maybe not indexed either) then doing 1 big distinct over the entirety may turn out to be faster

Conclusion: there is no 1 answer that fits all situations; the best thing to do is to try both and see what works best. EXPLAIN ANALYZE is your friend. You'll build up an understanding on what goes on in the background and with that experience and your knowledge of the (probably to be expected) data you're working on this should help you decide on how to proceed. Keep in mind that (overly) optimizing for the current situation might bite you in the back in say 1 year when the amount of data may have increased tremendously.

Upvotes: 1

Bill Weiner
Bill Weiner

Reputation: 11032

Depending on what percentage is the unique addresses to total addresses you can find that GROUP BY will perform better. If the percentage is high is often a better way to achieve uniqueness. Try:

SELECT address_id FROM
(
  SELECT shipping_address_id AS address_id FROM table_A
  UNION ALL 
  SELECT billing_address_id AS address_id FROM table_A
)
GROUP BY address_id;

Also UNION (vs UNION ALL) can be slow and running 3 distinct operations would lead me to believe that this would be the worst performing. Let us know which is best, I love seeing real experimental data results.

Upvotes: 1

CleverUser123
CleverUser123

Reputation: 46

I believe you should use:

SELECT DISTINCT shipping_address_id AS address_id FROM table_A
UNION
SELECT DISTINCT billing_address_id AS address_id FROM table_A

This one is naturally easier because it contains less characters (more useful for files) and is more human readable (you want to know what it means).

Upvotes: 1

Related Questions