nick
nick

Reputation: 3458

Why does TOP or SET ROWCOUNT make my query so slow?

I have a SQL Server 2008 database with approximately 14 millions rows. In it there are two tables

Table1
    rowId int, primary key
    someData1 int
    someData2 int...

Table2
    id int, primary key
    rowId ==> int, refers to the rowId from Table1
    someCalculatedData int...

    Table2.rowId is not a foreign key, but I did make a Non-clustered, Non-Unique index on it

When I import the data I insert all the data into Table1 and then find the rows that don't have an entry in Table2 and insert data into this table.

I was doing this in batches by selecting 250,000 rows at a time, processing the data and inserting them into Table2, and then finding the next 250,000 rows and so on until they were no rows in Table1 that don't have any entry in Table2.

select TOP 250000 rowId from Table1 
       where NOT EXISTS (select rowId from Table2 where Table1.rowId=Table2.rowId)

This query has really slowed down and now it takes over two minutes to get the next batch. If I run the query without the TOP or SET ROWCOUNT keywords and get all of the rows, then the query returns the results in ~15 seconds.

Does anyone know why TOP and SET ROWCOUNT cause the query to take so much longer then getting all of the data?

Can I improve the performance of the query and still only get a subset of the data each time?

Upvotes: 4

Views: 3100

Answers (3)

ps.
ps.

Reputation: 4360

see if this helps

select top 250000 t1.rowid 
from Table1 t1 
left outer join table2  t2 
on t1.rowid=t2.rowid 
where t2.rowid is null

Upvotes: 3

nick
nick

Reputation: 3458

Apparently my index or the index statistics on Table2.rowId weren't fresh and the query plan for the TOP/SET ROWCOUNT queries to perform poorly.

I reorganized/rebuilt the index and the query performance for the TOP/SET ROWCOUNT queries was greatly improved.

Upvotes: 1

Sonny Boy
Sonny Boy

Reputation: 8016

This occurs because all of the query needs to run on the server before the server can decide which the "top" 250,000 rows are. Only then will the data start coming back to your client over the network.

Without the "top" statement the server will start sending data immediately, though I suspect the total time for all the data to be sent from the server and received by your machine is likely to be similar to the statement including the top.

Upvotes: 1

Related Questions