Reputation: 2720
I have a data load scenario where I create dynamic sql query to pull data and cache in our service. There is 1 table that contains all product data : ProductHistory (47 columns, 200,000 records + and will keep growing)
What I need: Get the latest products by using the maximum id, maximum version and maximum changeid.
First Attempt:
SELECT distinct Product.* FROM ProductHistory product
WHERE product.version =
(SELECT max(version) from ProductHistory p2 where product.Id = p2.Id
and product.changeId =
(SELECT max(changeid) from ProductHistory p3 where p2.changeId = p3.changeId))
This took more than 2.51 minutes.
Other Failed Attempt:
select distinct product.* from ProductHistory product
where CAST(CAST(id as nvarchar)+'0'+CAST(Version as nvarchar)+'0'+CAST(changeid as nvarchar) as decimal) =
(select MAX(CAST(CAST(id as nvarchar)+'0'+CAST(Version as nvarchar)+'0'+CAST(changeid as nvarchar) as decimal)) from ProductHistory p2
where product.Id = p2.Id)
It basically uses the same principle as when you order dates, concatenating the numbers ordered by relevance.
For example 11 Jun 2007 = 20070711
And in our case: Id = 4 , version = 127, changeid = 32 => 40127032
The zeros are there not to mix up the 3 different ids
But this one takes 3.10 minutes !!! :(
So, I basically need a way to make my first attempt query better by any chance. I was also wondering with such amount of data, is this the best speed of retrieval that I should expect ?
I ran sp_helpindex ProductHistory and found out the indexes as below :
PK_ProductHistoryNew - clustered, unique, primary key located on PRIMARY- Id, Version
I wrapped the first query in a SP but still no change.
So, wondering by what other means we can improve the performance of this operation ?
Thanks, Mani p.s : I am just running these queries in SQL management stuido to see the time.
Upvotes: 8
Views: 2015
Reputation: 5439
Try this CTE, it should be the fastest option possible and you probably won't even need indexes to get great speed:
with mysuperfastcte as (
select product.*,
row_number() over (partition by id order by version desc) as versionorder,
row_number() over (partition by id order by changeid desc) as changeorder
from ProductHistory as product
)
select distinct product.*
from mysuperfastcte
where versionorder = 1
and changeorder = 1;
NB. I think you may have a bug at this point in your code so please confirm and double check the results you are expecting with my code:
and product.changeId = (SELECT max(changeid) from ProductHistory p3 where p2.changeId = p3.changeId))
Also - obviously reduce the number of columns you are returning to just those you need and then run the following before running your query and check the messages output:
SET STATISTICS IO ON
Look for tables with high logical reads and figure out where an index will help you.
Hint: If my code works for you then depending on the columns you need you could do:
create index ix1 (id, version desc) include (changeid, .... ) on ProductHistory.
I hope this helps!
Upvotes: 0
Reputation: 99
I have a feeling this query will take longer as they number of rows increases, but it's worth a shot:
SELECT * FROM
(
SELECT Col1, Col2, Col3,
ROW_NUMBER() OVER (PARTITION BY ProductHistory.Id ORDER BY Version DESC, ChangeID DESC) AS RowNumber
FROM ProductHistory
)
WHERE RowNumber = 1
Upvotes: 0
Reputation: 1514
This is getting a bit funky, but I wonder if partitioning would work:
SELECT Id
FROM (
SELECT Id,
MAX(version) OVER (PARTITION BY changeId) max_version
FROM ProductHistory
) s
where version = s.max_version
Upvotes: 0
Reputation: 115510
I think you need an index on (Id, changeId, version)
for this query. Please provide the table definition, the indexes on the table now and the query plan for your query.
Upvotes: 0
Reputation: 65147
Some things I see:
DISTINCT
necessary? If you do a DISTINCT *
it's unlikely to have any benefit but it will have overhead to check for duplicates across all fields.WHERE
clause, JOIN
to a derived table. This should process only once. I suspect your WHERE
clause is processing multiple times.<-- -->
SELECT Product.*
FROM ProductHistory product
INNER JOIN ( SELECT P.Id,
MAX(p.version) as [MaxVer],
MAX(p.Changeid) as [MaxChange]
FROM Product p
GROUP BY p.ID) SubQ
ON SubQ.ID = product.ID
AND SubQ.MaxChange = Product.ChangeID
AND SubQ.MaxVer = Product.Version
You should also have an index on Id, Version, ChangeID
for this.
Upvotes: 4
Reputation: 185
Well, storing everything in the table is not the way to do. Better is to store the last version in a table and use another one (with the same structure) for the history (as I guess you are more interested in current products than old ones). And concept issues will create many workarounds...
Also, do not use DISTINCT because it often hides issues in the query (usually, if duplicates are retrieved, it means you can optimize better).
Now, the best part: how to resolve your problem? I guess you should use the grouping principle giving something like this:
SELECT max(id), max(version), max(changeid)
FROM ProductHistory p
WHERE <filter if necessary for old products or anything else>
GROUP BY version, changeid
HAVING version = max(version)
AND changeid = max(changeid)
AND id = max(id)
But, if I look at your PK, I'm surprised, the changeid is not relevant as you should deal with the id and version only...
I am not sure if my request is fully correct because I can not test but I guess you can do some testings.
Upvotes: 1
Reputation: 6463
Generaly speaking, select max() needs to sort through the whole table. And you are doing it twice
SELECT TOP 1 is way faster, but you need to make sure your index is right and you have a correct ORDER BY. See if you can play with that.
Upvotes: -2
Reputation: 69250
Run the query from Sql Server Management Studio and look at the query plan to see where the bottle neck is. Any place you see a "table scan" or "index scan" it has to go through all data to find what it is looking for. If you create appropriate indexes that can be used for these operations it should increase performance.
Upvotes: 6