Reputation: 407
I have a very large table containing price history.
CREATE TABLE [dbo].[SupplierPurchasePrice](
[SupplierPurchasePriceId] [int] IDENTITY(1,1) PRIMARY KEY,
[ExternalSupplierPurchasePriceId] [varchar](20) NULL,
[ProductId] [int] NOT NULL,
[SupplierId] [int] NOT NULL,
[Price] [money] NOT NULL,
[PreviousPrice] [money] NULL,
[SupplierPurchasePriceDate] [date] NOT NULL,
[Created] [datetime] NULL,
[Modified] [datetime] NULL,
)
Per Product(Id) and Supplier(Id) I have hundreds of price records. Now there is a need to remove the bulk of the data but still keep some historic data. For each Product(Id) and Supplier(Id) i want to keep, let's say, 14 records. But not the first or last 14. I want to keep the first and last record. And then keep 12 records evenly in between the first and last. That way I keep some history in tact.
I cannot figure out a way to do this directly with a stored procedure instead of through my c# ORM (which is way too slow).
Upvotes: 3
Views: 122
Reputation: 107317
(Caveat - I read your question incorrectly. This retains the first, last and every 14th record in between, not necessarily a total of 14 records.)
After backing up your database, and trying a dry run this first to check that this will delete the required data (Assumed here is that the chronological order of records is set by SupplierPurchasePriceDate
- change the ORDER BY
as appropriate if not the case)
WITH CTE AS
(
SELECT
[SupplierId],
[ProductId],
[SupplierPurchasePriceId],
[SupplierPurchasePriceDate],
ROW_NUMBER() OVER (Partition BY [SupplierId], [ProductId]
ORDER BY [SupplierPurchasePriceDate]) -1 AS Rnk
FROM [dbo].[SupplierPurchasePrice]
)
SELECT cteRank.*
FROM
CTE cteRank
JOIN
(SELECT ProductId, SupplierId, MAX(Rnk) as MaxRnk
FROM CTE cteMax
GROUP BY ProductId, SupplierID) X
ON cteRank.SupplierId = X.SupplierId AND cteRank.ProductId = X.ProductId
WHERE cteRank.Rnk % 12 != 0 AND cteRank.Rnk != X.MaxRnk;
If this works as expected, and since it seems you already have a surrogate key, then the delete step is simply deleting the matched surrogate keys:
DELETE FROM [dbo].[SupplierPurchasePrice]
WHERE [SupplierPurchasePriceId] IN (...)
It works by ranking the price data by date and then deleting records which aren't modulo 12 (Change as per your requirement). The -1 is because ROW_NUMBER()
is 1 based. The first record is retained because as the first Modulo match. The MaxRnk
step is to also preserve the latest price for each Supplier
, Product
pair.
Note that this will retain the first, every 12th thereafter, and the last. So there will likely be an uneven gap between the latest (Max) and second last record (the last Modulo = 0). But surely, close enough?
Upvotes: 1
Reputation: 1270391
Here is a direct counting approach to solving the problem:
select spp.*
from (select spp.*,
sum(12.5 / (cnt - 1)) over (partition by SupplierId, ProductId
order by SupplierPurchasePriceId
) as cum
from (select spp.*,
row_number() over (partition by SupplierId, ProductId
order by SupplierPurchasePriceId
) as seqnum,
count(*) over (partition by SupplierId, ProductId) as cnt,
from SupplierPurchasePrice spp
) spp
) spp
where seqnum = 1 or seqnum = cnt or cnt <= 14 or
(floor(cumgap) <> floor(cumgap - 12.5/(cnt - 1)));
The challenge is deciding where the 12 records in between go. This calculates an average "gap" in the records, as 12.5/(cnt - 1)
. This is a constant that is then accumulated over the records. It will go from basically 0 to 12.5 in the largest record. The idea is to grab any record where this passes an integer value. So, if the cumulative goes form 2.1 to 2.3, then the record is not chosen. If it goes from 2.9 to 3.1 then the record is chosen.
The number 12.5 is not magic. Any number between 12 and 13 should do. Except for the issue with choosing the oldest and most recent values. I chose 12.5 to be extra sure that these don't count for the 12.
You can see the same logic working here at SQL Fiddle. The flag column shows which would be chosen and the totflag validates that exactly 14 are chosen.
Upvotes: 3
Reputation: 78183
I would try something like
select
to_keep.SupplierPurchasePriceId
from
(select
foo.SupplierPurchasePriceId,
row_number() over (partition by ProductId, SupplierId, tile_num order by Created) as takeme
from
(select
SupplierPurchasePriceId,
ProductId,
SupplierId,
Created,
ntile(13) over(partition by ProductId, SupplierId order by Created) as tile_num
from
SupplierPurchasePrice
) foo
) to_keep
where
to_keep.takeme = 1
union
select distinct
last_value(SupplierPurchasePriceId) over (partition by ProductId, SupplierId order by Created range between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) as SupplierPurchasePriceId
from
SupplierPurchasePrice
This should give primary keys of rows to be kept. Performance may vary. Not tested.
Upvotes: 1