Reputation: 19612
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan. The sample data posted below is all for the same ID so I didn't list it.
To make things a bit clearer, take a look at the sample data for 03.06.2009:
The following timespans are overlapping or contiunous and need to merge into one timespan
The resulting timespan would be from 05:54:48 to 10:00:13. Since there's a gap between 10:00:13 and 10:12:50 we also have the following timespans:
which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
Below you will find the sample data and the flattened data as I would need it. The duration column is just informative.
Any solution - be it SQL or not - is appreciated.
EDIT: Since there are lots of different and interesting solutions I'm refining my original question by adding constraints to see the "best" (if there is one) solution bubble up:
Within these constraints, what would be the best solution? I'm afraid that most of the solutions will be horribly slow since they join on the combination of date and time, which is not an index field in my case.
Would you do all the merging on the client or the server side? Would you first create an optimized temp table and use one of the proposed solutions with that table? I didn't have the time to test the solutions until now but I will keep you informed what works best for me.
Sample data:
Date | Start | Stop
-----------+----------+---------
02.06.2009 | 05:55:28 | 09:58:27
02.06.2009 | 10:15:19 | 13:58:24
02.06.2009 | 13:58:24 | 13:58:43
03.06.2009 | 05:54:48 | 10:00:13
03.06.2009 | 09:26:45 | 09:59:40
03.06.2009 | 10:12:50 | 10:27:25
03.06.2009 | 10:13:12 | 11:14:56
03.06.2009 | 10:27:25 | 10:27:31
03.06.2009 | 10:27:39 | 13:53:38
03.06.2009 | 11:14:56 | 11:15:03
03.06.2009 | 11:15:30 | 14:02:14
03.06.2009 | 13:53:38 | 13:53:43
03.06.2009 | 14:02:14 | 14:02:31
04.06.2009 | 05:48:27 | 09:58:59
04.06.2009 | 06:00:00 | 09:59:07
04.06.2009 | 10:15:52 | 13:54:52
04.06.2009 | 10:16:01 | 13:24:20
04.06.2009 | 13:24:20 | 13:24:24
04.06.2009 | 13:24:32 | 14:00:39
04.06.2009 | 13:54:52 | 13:54:58
04.06.2009 | 14:00:39 | 14:00:49
05.06.2009 | 05:53:58 | 09:59:12
05.06.2009 | 10:16:05 | 13:59:08
05.06.2009 | 13:59:08 | 13:59:16
06.06.2009 | 06:04:00 | 10:00:00
06.06.2009 | 10:16:54 | 10:18:40
06.06.2009 | 10:18:40 | 10:18:45
06.06.2009 | 10:23:00 | 13:57:00
06.06.2009 | 10:23:48 | 13:57:54
06.06.2009 | 13:57:21 | 13:57:38
06.06.2009 | 13:57:54 | 13:57:58
07.06.2009 | 21:59:30 | 01:58:49
07.06.2009 | 22:12:16 | 01:58:39
07.06.2009 | 22:12:25 | 01:58:28
08.06.2009 | 02:10:33 | 05:56:11
08.06.2009 | 02:10:43 | 05:56:23
08.06.2009 | 02:10:49 | 05:55:59
08.06.2009 | 05:55:59 | 05:56:01
08.06.2009 | 05:56:11 | 05:56:14
08.06.2009 | 05:56:23 | 05:56:27
Flattened result:
Date | Start | Stop | Duration
-----------+----------+----------+---------
02.06.2009 | 05:55:28 | 09:58:27 | 04:02:59
02.06.2009 | 10:15:19 | 13:58:43 | 03:43:24
03.06.2009 | 05:54:48 | 10:00:13 | 04:05:25
03.06.2009 | 10:12:50 | 14:02:31 | 03:49:41
04.06.2009 | 05:48:27 | 09:59:07 | 04:10:40
04.06.2009 | 10:15:52 | 14:00:49 | 03:44:58
05.06.2009 | 05:53:58 | 09:59:12 | 04:05:14
05.06.2009 | 10:16:05 | 13:59:16 | 03:43:11
06.06.2009 | 06:04:00 | 10:00:00 | 03:56:00
06.06.2009 | 10:16:54 | 10:18:45 | 00:01:51
06.06.2009 | 10:23:00 | 13:57:58 | 03:34:58
07.06.2009 | 21:59:30 | 01:58:49 | 03:59:19
08.06.2009 | 02:10:33 | 05:56:27 | 03:45:54
Upvotes: 15
Views: 3965
Reputation: 11
I needed to do something similar and wanted a solution that made sense to me:
I start by creating a table of the "smallest time intervals", each one being the interval between successive time points (regardless of whether the pairs of time points are contained within a single time span or not):
create table Time_Pairs
as
select d1,
lead(d1) over (order by d1) as d2
from ( -- all time points
select d1
from (select d1 from Test_Time_Spans union all select d2 from Test_Time_Spans)
group by 1
)
I then keep only the pairs which are within a single time span, and flag whether the small time span is contiguous with the previous and next time span:
create table Non_Overlapping_Time_Spans
as
select s.d1,
s.d2,
case
when s.d1=lag(s.d2) over (order by s.d1) then 1 -- prev time interval is contiguous with this one
else 0
end as merge_prev,
case
when s.d2=lead(s.d1) over (order by s.d1) then 1 -- next time interval is contiguous with this one
else 0
end as merge_next
from Test_Time_Spans t inner join
Time_Pairs s
on s.d1>=t.d1
and s.d2<=t.d2
group by 1,2
Finally, I keep only the start time and end time of each block by picking only the times where merge_prev or merge_next are equal to 0:
select distinct -- need to de-dupe
case
when merge_prev=0 then d1
else lag(d1) over (order by d1)
end as d1,
case
when merge_next=0 then d2
else lead(d2) over (order by d1)
end as d2
from Non_Overlapping_Time_Spans
where merge_prev=0 or merge_next=0
order by 1
Upvotes: 0
Reputation: 2747
Extending on MahlerFive answer I wrote a swift extension to DateTools. So far it has passed all my tests.
extension DTTimePeriodCollection {
func flatten() {
self.sortByStartAscending()
guard let periods = self.periods() else { return }
if periods.count < 1 { return }
var flattenedPeriods = [DTTimePeriod]()
let flatdate = DTTimePeriod()
for period in periods {
guard let periodStart = period.StartDate, let periodEnd = period.EndDate else { continue }
if !flatdate.hasStartDate() { flatdate.StartDate = periodStart }
if !flatdate.hasEndDate() { flatdate.EndDate = periodEnd }
if periodStart.isEarlierThanOrEqualTo(flatdate.EndDate) && periodEnd.isGreaterThanOrEqualTo(flatdate.EndDate) {
flatdate.EndDate = periodEnd
} else {
flattenedPeriods.append(flatdate.copy())
flatdate.StartDate = periodStart
flatdate.EndDate = periodEnd
}
}
flattenedPeriods.append(flatdate.copy())
// delete all periods
for var i = 0 ; i < periods.count ; i++ { self.removeTimePeriodAtIndex(0) }
// add flattened periods to self
for flat in flattenedPeriods { self.addTimePeriod(flat) }
}
Upvotes: 0
Reputation: 425833
In MySQL
:
SELECT grouper, MIN(start) AS group_start, MAX(end) AS group_end
FROM (
SELECT start,
end,
@r := @r + (@edate < start) AS grouper,
@edate := GREATEST(end, CAST(@edate AS DATETIME))
FROM (
SELECT @r := 0,
@edate := CAST('0000-01-01' AS DATETIME)
) vars,
(
SELECT rn_date + INTERVAL TIME_TO_SEC(rn_start) SECOND AS start,
rn_date + INTERVAL TIME_TO_SEC(rn_end) SECOND + INTERVAL (rn_start > rn_end) DAY AS end
FROM t_ranges
) q
ORDER BY
start
) q
GROUP BY
grouper
ORDER BY
group_start
Same decision for SQL Server
is described in the following article in my blog:
Here's the function to do this:
DROP FUNCTION fn_spans
GO
CREATE FUNCTION fn_spans(@p_from DATETIME, @p_till DATETIME)
RETURNS @t TABLE
(
q_start DATETIME NOT NULL,
q_end DATETIME NOT NULL
)
AS
BEGIN
DECLARE @qs DATETIME
DECLARE @qe DATETIME
DECLARE @ms DATETIME
DECLARE @me DATETIME
DECLARE cr_span CURSOR FAST_FORWARD
FOR
SELECT s_date + s_start AS q_start,
s_date + s_stop + CASE WHEN s_start < s_stop THEN 0 ELSE 1 END AS q_end
FROM t_span
WHERE s_date BETWEEN @p_from - 1 AND @p_till
AND s_date + s_start >= @p_from
AND s_date + s_stop <= @p_till
ORDER BY
q_start
OPEN cr_span
FETCH NEXT
FROM cr_span
INTO @qs, @qe
SET @ms = @qs
SET @me = @qe
WHILE @@FETCH_STATUS = 0
BEGIN
FETCH NEXT
FROM cr_span
INTO @qs, @qe
IF @qs > @me
BEGIN
INSERT
INTO @t
VALUES (@ms, @me)
SET @ms = @qs
END
SET @me = CASE WHEN @qe > @me THEN @qe ELSE @me END
END
IF @ms IS NOT NULL
BEGIN
INSERT
INTO @t
VALUES (@ms, @me)
END
CLOSE cr_span
RETURN
END
Since SQL Server
lacks an easy way to refer to previously selected rows in a resultset, this is one of rare cases when cursors in SQL Server
work faster than set-based decisions.
Tested on 1,440,000
rows, works for 24
seconds for the full set, and almost instant for a range of day or two.
Note the additional condition in the SELECT
query:
s_date BETWEEN @p_from - 1 AND @p_till
This seems to be redundant, but it is actually a coarse filter to make your index on s_date
usable.
Upvotes: 4
Reputation: 10411
To help answer the question, here is the sample data given in the question in a table variable like Hainstech used:
declare @T1 table (d1 datetime, d2 datetime)
insert @T1 (d1,d2)
select '02 June 2009 05:55:28','02 June 2009 09:58:27'
union all select '02 June 2009 10:15:19','02 June 2009 13:58:24'
union all select '02 June 2009 13:58:24','02 June 2009 13:58:43'
union all select '03 June 2009 05:54:48','03 June 2009 10:00:13'
union all select '03 June 2009 09:26:45','03 June 2009 09:59:40'
union all select '03 June 2009 10:12:50','03 June 2009 10:27:25'
union all select '03 June 2009 10:13:12','03 June 2009 11:14:56'
union all select '03 June 2009 10:27:25','03 June 2009 10:27:31'
union all select '03 June 2009 10:27:39','03 June 2009 13:53:38'
union all select '03 June 2009 11:14:56','03 June 2009 11:15:03'
union all select '03 June 2009 11:15:30','03 June 2009 14:02:14'
union all select '03 June 2009 13:53:38','03 June 2009 13:53:43'
union all select '03 June 2009 14:02:14','03 June 2009 14:02:31'
union all select '04 June 2009 05:48:27','04 June 2009 09:58:59'
union all select '04 June 2009 06:00:00','04 June 2009 09:59:07'
union all select '04 June 2009 10:15:52','04 June 2009 13:54:52'
union all select '04 June 2009 10:16:01','04 June 2009 13:24:20'
union all select '04 June 2009 13:24:20','04 June 2009 13:24:24'
union all select '04 June 2009 13:24:32','04 June 2009 14:00:39'
union all select '04 June 2009 13:54:52','04 June 2009 13:54:58'
union all select '04 June 2009 14:00:39','04 June 2009 14:00:49'
union all select '05 June 2009 05:53:58','05 June 2009 09:59:12'
union all select '05 June 2009 10:16:05','05 June 2009 13:59:08'
union all select '05 June 2009 13:59:08','05 June 2009 13:59:16'
union all select '06 June 2009 06:04:00','06 June 2009 10:00:00'
union all select '06 June 2009 10:16:54','06 June 2009 10:18:40'
union all select '06 June 2009 10:18:40','06 June 2009 10:18:45'
union all select '06 June 2009 10:23:00','06 June 2009 13:57:00'
union all select '06 June 2009 10:23:48','06 June 2009 13:57:54'
union all select '06 June 2009 13:57:21','06 June 2009 13:57:38'
union all select '06 June 2009 13:57:54','06 June 2009 13:57:58'
union all select '07 June 2009 21:59:30','07 June 2009 01:58:49'
union all select '07 June 2009 22:12:16','07 June 2009 01:58:39'
union all select '07 June 2009 22:12:25','07 June 2009 01:58:28'
union all select '08 June 2009 02:10:33','08 June 2009 05:56:11'
union all select '08 June 2009 02:10:43','08 June 2009 05:56:23'
union all select '08 June 2009 02:10:49','08 June 2009 05:55:59'
union all select '08 June 2009 05:55:59','08 June 2009 05:56:01'
union all select '08 June 2009 05:56:11','08 June 2009 05:56:14'
union all select '08 June 2009 05:56:23','08 June 2009 05:56:27'
Upvotes: 0
Reputation: 1912
Here is a recursive CTE solution, but I took the liberty of assigning a date and time to each column rather than pulling the date out separately. Helps to avoid some messy special case code. If you must store the date separately, I would use a view of CTE to make it look like two datetime columns and go with this approach.
create test data:
create table t1 (d1 datetime, d2 datetime)
insert t1 (d1,d2)
select '2009-06-03 10:00:00', '2009-06-03 14:00:00'
union all select '2009-06-03 13:55:00', '2009-06-03 18:00:00'
union all select '2009-06-03 17:55:00', '2009-06-03 23:00:00'
union all select '2009-06-03 22:55:00', '2009-06-04 03:00:00'
union all select '2009-06-04 03:05:00', '2009-06-04 07:00:00'
union all select '2009-06-04 07:05:00', '2009-06-04 10:00:00'
union all select '2009-06-04 09:55:00', '2009-06-04 14:00:00'
Recursive CTE:
;with dateRanges (ancestorD1, parentD1, d2, iter) as
(
--anchor is first level of collapse
select
d1 as ancestorD1,
d1 as parentD1,
d2,
cast(0 as int) as iter
from t1
--recurse as long as there is another range to fold in
union all select
tLeft.ancestorD1,
tRight.d1 as parentD1,
tRight.d2,
iter + 1 as iter
from dateRanges as tLeft join t1 as tRight
--join condition is that the t1 row can be consumed by the recursive row
on tLeft.d2 between tRight.d1 and tRight.d2
--exclude identical rows
and not (tLeft.parentD1 = tRight.d1 and tLeft.d2 = tRight.d2)
)
select
ranges1.*
from dateRanges as ranges1
where not exists (
select 1
from dateRanges as ranges2
where ranges1.ancestorD1 between ranges2.ancestorD1 and ranges2.d2
and ranges1.d2 between ranges2.ancestorD1 and ranges2.d2
and ranges2.iter > ranges1.iter
)
Gives output:
ancestorD1 parentD1 d2 iter
----------------------- ----------------------- ----------------------- -----------
2009-06-04 03:05:00.000 2009-06-04 03:05:00.000 2009-06-04 07:00:00.000 0
2009-06-04 07:05:00.000 2009-06-04 09:55:00.000 2009-06-04 14:00:00.000 1
2009-06-03 10:00:00.000 2009-06-03 22:55:00.000 2009-06-04 03:00:00.000 3
Upvotes: 1
Reputation: 47402
Here is a SQL only solution. I used DATETIME for the columns. Storing the time separate is a mistake in my opinion, as you will have problems when the times go past midnight. You can adjust this to handle that situation though if you need to. The solution also assumes that the start and end times are NOT NULL. Again, you can adjust as needed if that's not the case.
The general gist of the solution is to get all of the start times that don't overlap with any other spans, get all of the end times that don't overlap with any spans, then match the two together.
The results match your expected results except in one case, which checking by hand looks like you have a mistake in your expected output. On the 6th there should be a span that ends at 2009-06-06 10:18:45.000.
SELECT
ST.start_time,
ET.end_time
FROM
(
SELECT
T1.start_time
FROM
dbo.Test_Time_Spans T1
LEFT OUTER JOIN dbo.Test_Time_Spans T2 ON
T2.start_time < T1.start_time AND
T2.end_time >= T1.start_time
WHERE
T2.start_time IS NULL
) AS ST
INNER JOIN
(
SELECT
T3.end_time
FROM
dbo.Test_Time_Spans T3
LEFT OUTER JOIN dbo.Test_Time_Spans T4 ON
T4.end_time > T3.end_time AND
T4.start_time <= T3.end_time
WHERE
T4.start_time IS NULL
) AS ET ON
ET.end_time > ST.start_time
LEFT OUTER JOIN
(
SELECT
T5.end_time
FROM
dbo.Test_Time_Spans T5
LEFT OUTER JOIN dbo.Test_Time_Spans T6 ON
T6.end_time > T5.end_time AND
T6.start_time <= T5.end_time
WHERE
T6.start_time IS NULL
) AS ET2 ON
ET2.end_time > ST.start_time AND
ET2.end_time < ET.end_time
WHERE
ET2.end_time IS NULL
Upvotes: 8
Reputation: 57093
Similar question on SO here:
Min effective and termdate for contiguous dates
FWIW I up-voted the one that recommended Joe Celko's SQL For Smarties, Third Edition -- repeat: Third Edition (2005) -- which discusses various approaches, set base and procedural.
Upvotes: 3
Reputation: 5299
Assuming you:
Do the following:
first = first row in L
flat_date.start = first.start, flat_date.end = first.end
For each row in L:
if row.start < flat_date.end and row.end > flat_date.end: // adding on to a timespan
flat_date.end = row.end
else: // ending a timespan and starting a new one
add flat_date to F
flat_date.start = row.start, flat_date.end = row.end
add flat_date to F // adding the last timespan to the flattened list
Upvotes: 2