Reputation: 141
I am trying to find a way to accomplish this without using a cursor.
I have a set of data that totals about 13 million records. The interval between one record and the next varies, but are all between 5 and 20 minutes. I need to create a new table of data, but select the data so that this is at least a 30 minute gap between one record and the next.
For example, if I have this:
VID | Datetime
1 | 2016-01-01 00:00
1 | 2016-01-01 00:10
1 | 2016-01-01 00:12
1 | 2016-01-01 00:25
2 | 2016-01-01 00:40
4 | 2016-01-01 01:00
4 | 2016-01-01 02:13
6 | 2016-01-01 02:23
7 | 2016-01-01 02:25
8 | 2016-01-01 02:49
9 | 2016-01-01 02:59
9 | 2016-01-01 03:01
9 | 2016-01-01 03:09
9 | 2016-01-01 03:24
9 | 2016-01-01 04:05
The new table would look like this:
VID | Datetime
1 | 2016-01-01 00:00
2 | 2016-01-01 00:40
4 | 2016-01-01 02:13
8 | 2016-01-01 02:49
9 | 2016-01-01 03:24
I can do this with a cursor, but for millions of records that's crazy. I've seen mentions of something called a quirky update for similar situations, but I'm not sure what that is.
Currently using SQL Server 2014. Any help would be greatly appreciated.
Upvotes: 2
Views: 185
Reputation: 26
Things would be a lot easier if you had an identifier column instead of [vid]. If that was the case then you could do this:
with mycte (id, mydate, keepthis, offset)
as
(
select
id,
mydate,
1 keepthis,
0 offset
from mytable where id = 1
union all
select
t.id,
t.mydate,
case when datediff(mi, o.mydate, t.mydate)+o.offset >= 30 then 1 else 0 end keepthis,
case when datediff(mi, o.mydate, t.mydate)+o.offset >= 30 then 0 else datediff(mi, o.mydate, t.mydate)+o.offset end
from mytable t join mycte o on t.id = o.id+1
)
select id,mydate from mycte where keepthis=1
Upvotes: 1
Reputation: 2235
Thinking a little outside the box, if it is permissible to just get the earliest row in each 30 min interval then I have a solution that works.
Caveats:
--
--setup data
declare @t table (VID int, [Datetime] datetime);
insert @t values
(1, '1986-01-01 00:00'), --very early year
(1, '2016-01-01 00:10'),
(1, '2016-01-01 00:12'),
(1, '2016-01-01 00:25'),
(2, '2016-01-01 00:40'),
(4, '2016-01-01 01:00'),
(4, '2016-01-01 02:13'),
(6, '2016-01-01 02:23'),
(7, '2016-01-01 02:25'),
(8, '2016-01-01 02:49'),
(9, '2016-01-01 02:59'),
(9, '2016-01-01 03:01'),
(9, '2016-01-01 03:09'),
(9, '2016-01-01 03:24'),
(9, '2016-01-01 04:05');
select * from @t order by VID, [Datetime];
--select datediff(MI, (select min([Datetime]) from @t), (select max([Datetime]) from @t)); --15778325 records in 30 years - handled by t4 x t4 x t4 in tally generator
-- Tally generator courtesy of http://www.sqlservercentral.com/blogs/never_say_never/2010/03/19/tally_2D00_table_2D00_cte/
-- Tally Table CTE script (SQL 2005+ only)
-- You can use this to create many different numbers of rows... for example:
-- You could use a 3 way cross join (t3 x, t3 y, t3 z) instead of just 2 way to generate a different number of rows.
-- The # of rows this would generate for each is noted in the X3 comment column below.
-- For most common usage, I find t3 or t4 to be enough, so that is what is coded here.
-- If you use t3 in ‘Tally’, you can delete t4 and t5.
; WITH
-- Tally table Gen Tally Rows: X2 X3
t1 AS (SELECT 1 N UNION ALL SELECT 1 N), -- 4 , 8
t2 AS (SELECT 1 N FROM t1 x, t1 y), -- 16 , 64
t3 AS (SELECT 1 N FROM t2 x, t2 y), -- 256 , 4096
t4 AS (SELECT 1 N FROM t3 x, t3 y), -- 65536 , 16,777,216
t5 AS (SELECT 1 N FROM t4 x, t4 y), -- 4,294,967,296, A lot
Tally AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) N
FROM t4 x, t4 y, t4 z), -- Change the t3's to one of the other numbers above for more/less rows
--generate time values
Intervals as (
select t.N - 1 interval,
dateadd(mi, (t.N - 1) * 30, min_date.min_date) interval_start,
dateadd(mi, (t.N) * 30, min_date.min_date) next_interval_start
from (
select min([Datetime]) min_date
from @t
) min_date
join Tally t
on t.N <= datediff(MI, (select min([Datetime]) from @t), (select max([Datetime]) from @t)) / 30 + 1
),
--join intervals to data tables
Intervaled_data as (
select *, row_number() over (partition by i.interval order by t.[Datetime]) row_num
from @t t
join Intervals i
on t.[Datetime] >= i.interval_start and t.[Datetime] < i.next_interval_start
)
select i.VID, i.[Datetime]
from Intervaled_data i
where i.row_num = 1
order by i.VID, i.[Datetime];
Upvotes: 1