Reputation: 4655

Can this simple SQL query be optimized?

I have the following query:

SELECT COUNT(*) 
FROM Address adr INNER JOIN 
     Audit a on adr.UniqueId = a.UniqueId

on a database (1.3 million addresses, more than 4 million audits)
both UniqueId columns are clustered primary keys

The query is taking quite long to complete. I feel dumb, but is there any way to optimize it? I want to count all the address entries that have an underlying auditable.

EDIT: all your inputs are much appreciated, here are some more details:

The query will not be run often (it is only for validating), but thanks for the indexed view tip, I will add that to my knowledge for sure.
All Addresses have an associated 1-to-1 audit. Not all audits are addresses.
The query takes more than 1 minute to finish. I find this too long for a simple count.

Upvotes: 7

Answers (8)

Amy B

Reputation: 110071

Since you have two sets of data, ordered by the same value.. have you tried a merge join instead of the nested loop join?

SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT COUNT(*)  
FROM Address adr INNER JOIN  
     Auditable a on adr.UniqueId = a.UniqueId 
OPTION (LOOP JOIN)

SELECT COUNT(*)  
FROM Address adr INNER JOIN  
     Auditable a on adr.UniqueId = a.UniqueId 
OPTION (MERGE JOIN)

SELECT COUNT(*)  
FROM Address adr INNER JOIN  
     Auditable a on adr.UniqueId = a.UniqueId 
OPTION (HASH JOIN)

Edit:

These explanations are conceptual. SQL Server may be doing more sophisticated operations than my examples show. This conceptual understanding, matched with the measuring of time and logical IO by the SET STATISTICS commands, and examination of query execution plans - form the basis of my query optimizing technique (grown over four years). May it serve you as well as it has me.

Setup

Get 5 decks of cards.
Take 1 deck and produce a parent data set.
Take the other 4 decks and produce the child data set.
Order each data set by card value.
Let m be the number of cards in the parent data set.
Let n be the number of cards in the child data set.

NestedLoop

Take a card off the top of the parent data set.
Search (using binary search) within the child data set for the first occurence of a match.
Seek forward in the child data set from the first match until a non-match is found. You've now found all the matches.
Repeat this for each card in the parent data set.

The nested loop algorithm iterates the parent data set, and then searches the child data set once for each parent, making it cost: m * log(n)

Merge

Take a card off the top of the parent data set.
Take a card off the top of the child data set.
If the cards match, pull cards from the top of each deck until a non-match is found from each. Produce each matching pair between the parent and child matches.
If the cards do not match, find the smaller between the parent and child cards, and take a card off the top of that data set.

The merge join algorithm iterates the parent data set once and the child data set once, making it cost: m + n. It relies on the data being ordered. If you ask for a merge join on un-ordered data, you will incur an ordering operation! This brings the cost to (m * log(m)) + (n * log(n)) + m + n. Even that might be better than nested loop in some cases.

Hash

Get a card table.
Take each card from the parent data set and place it on the card table where you can find it (does not have to be anything to do with card value, just has to be convenient for you).
Take each card from the child data set, locate its matching parent on the cardboard table and produce the matching pair.

The hash join algorithm iterates the parent data set once and the child data set once, making it cost: m + n. It relies on having a big enough card table to hold the entire contents of the parent data set.

Upvotes: 11

Will Marcouiller

Reputation: 24132

The clause EXISTS is less expensive to run than an INNER JOIN.

select COUNT(adr.UniqueId)
    from Addresses adr
    where EXISTS (
        select 1
            from Auditables aud
            where aud.UniqueId = adr.UniqueId
    )

Does this suits your need?

N.B. Guids are very expensive for the database engine.

Upvotes: 1

Stephanie Page

Reputation: 3893

The real issue is the Nested Loop join. For each 1.4 Million rows in the Address table you're doing an index Seek into the Auditble table. That means 1.4M root block, branch block, and leaf block reads for a total of 4.2M block reads. The entire index is probably only 5K blocks or so... it should be doing a hash join so it reads both indexes once, and hashes through them.

If you think these tables are large, I'm guessing this is on a small box without a lot of memory. You need to make sure that you have enough memory allocated to fit the entire index into memory to make the hash join efficient.

Upvotes: 2

SideFX

Reputation: 839

For large tables such as these, you may wish to partition your data to increase query performance. Also, if you haven't already, try running the Tuning Advisor to see if there are any additional indexes that may be advantageous. In addition, have you reorganized your clustered indexes lately -- is it a task that is part of a maintanence package? Many times, this will greatly improve your performance as well.

Upvotes: 0

TomTom

Reputation: 62093

Missing index on the foreign key, I would say.

1.4 million and 4 million are not large tables, they are small. Say large when you go through 500 million entries, please.
For a real answer we need the execution plan / query plan, so we can see what happens.
And it would be nice to know what "Long" is in your world (given that you think 4 million rows are a lot). This question will not answer in 1 second ever - so what do you expect and what happens?
I bet, though, you have a missing index. SHort of that, I would start pointing at the hardware (because I have seen that, too, as the reason for crap performance).

Upvotes: 0

KM.

Reputation: 103589

If you run this query often and it needs to be super fast, create a materialized indexed view of it. There will be a slight overhead on INSERT/UPDATE/DELETEs, but this query will be just about instant. The aggregations can be precomputed and stored in the index to minimize expensive computations during query execution.

Improving Performance with SQL Server 2005 Indexed Views

Upvotes: 6

Chris Shaffer

Reputation: 32575

Is Auditable.UniqueID a foreign key reference to Address.UniqueID, meaning there are no values in Auditable that don't also exist in Address?

If so, this may work and may be faster:

SELECT COUNT(DISTINCT Auditable.UniqueID)
FROM Auditable

Note: This also assumes that UniqueID is unique(/primary key) in the Address table but not unique in the Auditable table

Upvotes: 1

Coding Flow

Reputation: 21881

Not sure if it would be faster but you could try the following

SELECT COUNT(adr.UniqueID) FROM Address adr INNER JOIN Auditable a on adr.UniqueId = a.UniqueId

It should give you the same count because unqieieid will never be null.

Upvotes: 0

Can this simple SQL query be optimized?

Answers (8)

Related Questions