Reputation: 1739
I have a table with close to 30 million records. Just several columns. One of the column 'Born'
have not more than 30 different values and there is an index defined on it. I need to be able to filter on that column and efficiently page through results.
For now I have (example if the year I'm searching for is '1970' - it is a parameter in my stored procedure):
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, (SELECT count(*) FROM PersonSubset) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Every query of that sort (only Born
parameter used) returns just over 1 million results.
I've noticed the biggest overhead is on the count used to return the total results. If I remove (SELECT count(*) FROM PersonSubset) AS TotalPeople
from the select clause the whole thing speeds up a lot.
Is there a way to speed up the count in that query. What I care about is to have the paged results returned and the total count.
Upvotes: 8
Views: 22371
Reputation: 2880
Here is a novel approach using system dmv's if you can get by with a "good enough" count, you don't mind creating an index for every distinct value for [Born], and you don't mind feeling a little bit dirty inside.
Create a filtered index for each year:
--pick a column to index, it doesn't matter which.
CREATE INDEX IX_Person_filt_1970 on Person ( id ) WHERE Born = '1970'
CREATE INDEX IX_Person_filt_1971 on Person ( id ) WHERE Born = '1971'
CREATE INDEX IX_Person_filt_1972 on Person ( id ) WHERE Born = '1972'
Then use the [rows] column from sys.partitions to to get a rowcount.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *,
(
SELECT sum(rows)
FROM sys.partitions p
inner join sys.indexes i on p.object_id = i.object_id and p.index_id =i.index_id
inner join sys.tables t on t.object_id = i.object_id
WHERE t.name ='Person'
and i.name = 'IX_Person_filt_' + '1970' --or at @p1
) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Sys.partitions isn't guaranteed to be accurate in 100% of cases (usually it is exact or really close) This approach won't work if you need to filter on anything but [Born]
Upvotes: 1
Reputation: 155632
Updated following discussion in comments
The cause of the problem here is very low cardinality of the IX_Person_Born
index.
SQL indexes are very good at quickly narrowing down values, but they have problems when you have lots of records with the same value.
You can think of it as like the index of a phone book - if you want to find "Smith, John" you first find that there are lots of names that begin with S, and then pages and pages of people called Smith, and then lots of Johns. You end up scanning the book.
This is compounded because the index in the phone book is clustered - the records are sorted by surname. If instead you want to find everyone called "John" you'll be doing a lot of looking up.
Here there are 30 million records but only 30 different values, which means that the best possible index is still returning around 1 million records - at that sort of scale it might as well be a table-scan. Each of those 1 million results is not the actual record - it's a lookup from the index to the table (the page number in the phone book analogy), which makes it even slower.
A high cardinality index (say for full date of birth), rather than year would be much quicker.
This is a general problem for all OLTP relational databases: low cardinality + huge datasets = slow queries
because index-trees don't help much.
In short: there's no significantly quicker way to get the count using T-SQL and indexes.
You have a couple of options:
Either OLAP/Cube rollups or do it yourself:
select Born, count(*)
from Person
group by Born
The pro is that cube lookups or checking your cache is very fast. The problem is that the data will get out of date and you need some way to account for that.
Split into two queries:
SELECT count(*)
FROM Person
WHERE Born = '1970'
SELECT TOP 30 *
FROM Person
WHERE Born = '1970'
Then run these either in parallel server side, or add it to the user interface.
This problem is one of the big advantages no-SQL solutions have over traditional relational databases. In a no-SQL system the Person
table is federated (or sharded) across lots of cheap servers. When a user searches every server is checked at the same time.
At this point a technology change is probably out, but it may be worth investigating so I've included it.
I have had similar problems in the past with databases of this kind of size, and (depending on context) I've used both options 1 and 2. If the total here is for paging then I'd probably go with option 2 and AJAX call to get the count.
Upvotes: 11
Reputation: 110071
DECLARE @TotalPeople int
--does this query run fast enough? If not, there is no hope for a combo query.
SET @TotalPeople = (SELECT count(*) FROM Person WHERE Born = '1970')
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, @TotalPeople as TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
You usually can't take a slow query, combine it with a fast query, and wind up with a fast query.
One of the column 'Born' have not more than 30 different values and there is an index defined on it.
Either SQL Server isn't using the index or statistics, or the index and statistics aren't helpful enough.
Here is a desperate measure that will force Sql's hand (at the potential cost of making writes very expensive - measure that, and blocking schema changes to the Person table while the view exists).
CREATE VIEW dbo.BornCounts WITH SCHEMABINDING
AS
SELECT Born, COUNT_BIG(*) as NumRows
FROM dbo.Person
GROUP BY Born
GO
CREATE UNIQUE CLUSTERED INDEX BornCountsIndex ON BornCounts(Born)
By putting a clustered index on a view, you make it a system maintained copy. The size of this copy is much smaller than 30 Million rows, and it has the exact information you're looking for. I did not have to change the query to get it to use the view, but you're free to use the view's name in the query if you like.
Upvotes: 2
Reputation: 2245
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, **max(Row) AS TotalPeople**
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
why not like that ?
edit , dont know why bold doesnt work :<
Upvotes: 1