Reputation: 28699
I have a poor man's replication setup that I can't do anything about. Some identifying data (basically primary key) from a call_table
is copied into another table via a simple trigger, and then the "replication server" runs a stored procedure to copy the data from the queue table to a #temp table (to prevent locking in SQL 6.5 is the case that was made to me). Finally, a query uses the key data from the temp table to pull data back to the replication server from the call_table
using this query:
/* select the data to return to poor man replication server */
SELECT c.id,
c.date,
c.time,
c.duration,
c.location
FROM #tmp q, call_table c (NOLOCK)
WHERE q.id=c.id
AND q.date=c.date
AND q.time=c.time
AND q.duration=c.duration
AND q.location=c.location
GROUP BY c.id,
c.date,
c.time,
c.duration,
c.location
Once a night the queue table is purged and this starts over. While investigating this, the implicit cross join jumped at me (I'm on the side that they are usually evil), but then I read The power of the Cross Join. I'm here because I'm not quite convinced. Say the temp table has about 10,000 rows for the day, the call_table has about 100,000 for the month so far. How is this query going to work? Does it mash the two tables together for a total of 1,000,000,000 in memory, then use the group clause to trim it back down? Could you explain what steps SQL takes to compile the results?
My Query:
|--Hash Match Root(Aggregate, HASH:([c].[id], [c].[date], [c].[location], [c].[time], [c].[duration]), RESIDUAL:(((((((((((((((((((((([c].[id]=[c].[id] AND [c].[PIN]=[c].[PIN]) AND [c].[ORIG]=[c].[ORIG]) AND [c].[date]=[c].[date]) AND [c].[CTIME]=[c].[CTIME
|--Hash Match Team(Inner Join, HASH:([q].[id], [q].[date], [q].[location], [q].[time], [q].[duration])=([c].[id], [c].[date], [c].[location], [c].[time], [c].[duration]), RESIDUAL:(((([c].[id]=[q].[id] AND [c].[location]=[q].[location]) AND [c].[duration]=[q].[duration]) AND [
|--Table Scan(OBJECT:([db].[dbo].[queue] AS [q]))
|--Table Scan(OBJECT:([db].[dbo].[call_table] AS [c]))
Yours:
|--Merge Join(Right Semi Join, MERGE:([q].[id], [q].[date], [q].[time], [q].[duration], [q].[location])=([c].[id], [c].[date], [c].[time], [c].[duration], [c].[location]), RESIDUAL:(((([q].[id]=[c].[id] AND [q].[location]=[c].[location]) AND [q].[duration]=[c].[duration]) AND [q].[
|--Index Scan(OBJECT:([db].[dbo].[queue].[PK_queue] AS [q]), ORDERED)
|--Sort(ORDER BY:([c].[id] ASC, [c].[date] ASC, [c].[time] ASC, [c].[duration] ASC, [c].[location] ASC))
|--Table Scan(OBJECT:([db].[dbo].[call_table] AS [c]))
Upvotes: 1
Views: 678
Reputation: 425613
The query you described is no way a CROSS JOIN
.
SQL Server
is smart enough to transform the WHERE
condition into the JOIN
's.
However, I see no point in GROUP BY
here.
This query:
SELECT c.id,
c.date,
c.time,
c.duration,
c.location
FROM #tmp q, call_table c (NOLOCK)
WHERE q.id=c.id
AND q.date=c.date
AND q.time=c.time
AND q.duration=c.duration
AND q.location=c.location
GROUP BY c.id,
c.date,
c.time,
c.duration,
c.location
can be easilty rewritten as
SELECT c.id,
c.date,
c.time,
c.duration,
c.location
FROM call_table c (NOLOCK)
WHERE EXISTS
(
SELECT NULL
FROM #tmp q
WHERE q.id = c.id
AND q.date = c.date
AND q.time = c.time
AND q.duration = c.duration
AND q.location = c.location
)
, provided that c.id
is a PRIMARY KEY
.
If it's not, just add DISTINCT
to SELECT
above.
Update:
From your plan I see that that your query uses HASH JOIN
, while my uses MERGE SEMI JOIN
.
The latter one is usually more efficient if you have an ordered set, but for some reason the query does not use the composite index you created, but instead performs full table scan.
This is strange, since all your values are contained within the index.
Probably (probably) this is because your fields allow NULL
's.
Make sure that you use only the fields from the composite index both in WHERE
conditions and in SELECT
clause and, if possible, make them NOT NULL
.
This should make your query use preordered resultsets in MERGE SEMI JOIN
. You can tell it if you see neither TABLE SCAN
nor SORT
in the plan, just two INDEX SCAN
's.
And two more questions:
c.id
a PRIMARY KEY
on call_table
?q.id
a PRIMARY KEY
on #tmp
?If answer to both questions is yes
, then you will benefit from doing two things:
PRIMARY KEY
as CLUSTERED
on both tablesRewriting your query as this:
SELECT c.id,
c.date,
c.time,
c.duration,
c.location
FROM call_table c (NOLOCK)
JOIN #tmp q
ON q.id = c.id
AND q.date = c.date
AND q.time = c.time
AND q.duration = c.duration
AND q.location = c.location
Upvotes: 1
Reputation: 238176
How is this query going to work? Does it mash the two tables together for a total of 1,000,000,000 in memory, then use the group clause to trim it back down? Could you explain what steps SQL takes to compile the results?
It might go something like this. Say Sql Server decides to use a hash join. It creates an in-memory hash table of #temp, with a hash based on id, date, time, duration, and location. Then it iterates over the rows in call_table. For each row, it uses the hash table to detect if a matching row exists. If it does, the row gets added to the result table. So no 1,000,000,000 rows are ever in memory.
Another option (maybe even better here) is to iterate over one table, and use the id column to do an index lookup on the other table. This requires even less memory (although it would be very beneficial if the index were in cache.)
You can see what Sql Server really does by reading the execution plan. You can enable the execution plan under the Query menu.
Upvotes: 0