François M.
François M.

Reputation: 4278

Joins, conditions and speed in SQL

While preparing some requests, I was writing this :

SELECT *
FROM ta A
JOIN tb B
    ON A.col1 = B.col1 
JOIN tc C
    ON B.col2 = C.col2 
WHERE B.col3 = 'whatever'
AND C.col4 = 'whatever2'

And I began to think about the following :

SELECT *
FROM ta A
JOIN (SELECT * FROM tb WHERE col3 = 'whatever') B
    ON A.col1 = B.col1
JOIN (SELECT * FROM tc WHERE col4 = 'whatever2') C
    ON B.col2 = C.col2

(If I'm not mistaken, the result would be the same). I'm wondering if it would be significantly faster ? My guess is that it would but I'd be interested in knowing why/why not ?

(Because our server is down at the moment, I can't test it myself right now, so I'm asking here, I hope you won't mind.)

(In case it matters, the engine is Vertica, but my question isn't really specific to Vertica)

Upvotes: 0

Views: 163

Answers (2)

Jim Macaulay
Jim Macaulay

Reputation: 5141

Your first query will work fine, but the second query will not be executed and causes error. The reason behind it is, you are taking JOIN (SELECT * FROM tb WHERE B.col3 = 'whatever') B ON A.col1 = B.col1.

In this condition you are matching the column with A.col1 = B.col1. Here you will get A.col1 from ta table, but you will not get B.col1. While specifying a sub query in the join, you should not use ' * ' operator. Joins will not recognize this operator in a sub query. You need to specify required column names. Like the example in below query,

SELECT *
FROM ta A
JOIN (SELECT col1,col2 FROM tb WHERE B.col3 = 'whatever') B
    ON A.col1 = B.col1
JOIN (SELECT col2 FROM tc WHERE C.col4 = 'whatever2') C
    ON B.col2 = C.col2

This will execute and provides you a result. Two columns is taken in the first join sub query col1,col2, as you are using the condition B.col2 from B table in the second join condition. In a select clause you can provide ' * ' operator which provides you all the columns from all three tables. But you are not supposed to use the operator in a sub query of a join, as joins are coded in such a way.

Both the queries does not have much difference, but your first logic will execute faster compared to the second. In the second logic, two sub queries are used which makes multiple searches in the database and provides you result little slower than the first logic.

Upvotes: 0

woot
woot

Reputation: 7616

Your second query is a little off, it should be:

SELECT *
FROM ta A
JOIN (SELECT * FROM tb WHERE tb.col3 = 'whatever') B
    ON A.col1 = B.col1
JOIN (SELECT * FROM tc WHERE tc.col4 = 'whatever2') C
    ON B.col2 = C.col2

Notice the inline view where clauses need to reference the table in scope, not the alias for the view. B and C are out of scope within the inline views.

In any case, because you are doing an inner join, it won't matter from a results perspective because the condition is the same whether it occurs pre-join or post-join.

You can reasonably rely on the optimizer to do the following:

  1. Only materialize the columns required when needed.
  2. Push predicates down where it makes sense

That said, there should be no difference between the two statements. Most likely it is pushing down predicates for the first one to make it more like the second one. If you have statistics gathered, the optimizer should be smart enough to query these the same way (or really close).

That isn't to say I haven't seen what you have in your second query "fix" query issues for me in Vertica... but usually it's only when I am using multiple COUNT(DISTINCT ...) expressions or theta joins, etc.

Now if this were an outer join, then the statements would be different. The first one would apply the filter after the join, the second would be before the join.

Of course, I'll mention that you really just need to do an explain of both methods. Just make sure statistics are gathered.

Hope it helps.

Upvotes: 1

Related Questions